ads ros and lgsvl talk in dockers

Posted on 2020-01-14 |

backgound

previously, we had integrated ads ros nodes into one launch file. now we try to put ads nodes into docker, and talk to lgsvl simulator in another docker.

run dockerimage

1 2	docker build -t ads_ros . docker run -it ads_ros /bin/bash

unable to run cakin_make in dockerfile

RUN /bin/bash -c '. /opt/ros/kinetic/setup.bash; cd <into the desired folder e.g. ~/catkin_ws/src>; catkin_make'
``` 
#### dockerfile for ads ros
``` 
FROM ros:kinetic 
# create local catkin workspace 
ENV CATKIN_WS=/root/catkin_ws
ENV ROS_DISTRO=kinetic
RUN mkdir -p $CATKIN_WS/src
## install catkin_make 
## https://docs.ros.org/api/catkin/html/user_guide/installation.html
RUN APT_INSTALL="apt-get install -y --no-install-recommends" && \
    apt-get update && \
    DEBIAN_FRONTEND=noninteractive $APT_INSTALL \
        build-essential \
        apt-utils \
        ca-certificates \ 
        psmisc \
        cmake \
        vim \ 
        python-catkin-pkg \ 
        ros-${ROS_DISTRO}-catkin  \ 
        ros-${ROS_DISTRO}-tf  \
        ros-${ROS_DISTRO}-turtlesim \
        ros-${CATKIN_WS}-rosbridge-suite 
        iputils-ping \   
        net-tools 
### add third-party headers
# RUN source ~/.bashrc 
# copy ads ros into ws
COPY /src $CATKIN_WS/src
### build msgs 
RUN /bin/bash -c '. /opt/ros/${ROS_DISTRO}/setup.bash; cd ${CATKIN_WS}; catkin_make --pkg pcl_msgs pb_msgs autoware_msgs mobileye_msgs ibeo_msgs nmea_msgs '                   
### build ros nodes 
RUN /bin/bash -c '. /opt/ros/${ROS_DISTRO}/setup.bash; cd ${CATKIN_WS}; catkin_make '
# copy ros scripts
COPY /script $CATKIN_WS/scripts 
# run ros shell
WORKDIR ${CATKIN_WS}/scripts

ros envs

set ROS_IP on all involved containers to their IP, and set ROS_MASTER_URI to the IP of the roscore container. That would avoid the DNS problem. understand ros environment variables

$ROS_ROOT : set the location whre the ROS core packages are installed
$ROS_MASTER_URI : a required setting that tells nodes where they can locate the master
$ROS_IP or $ROS_HOSTNAME : sets the declared network address of a ROS node

get docker container’s IP

docker inspect -f "{{ .NetworkSettings.Networks..IPAddress }}" <container_name||container_id>

tips, network_name: e.g. host, bridge, ingress e.t.c.

with docker host net, then the container doesn’t have its own IP address allocated, but the application is available on the host’s IP address with customized port.

run roscore in docker and talk to rosnodes at host

start roscore from docker container as following :

sudo docker run -it –net host ads_ros /bin/bash
roscore
start other ros nodes at host

rosnode list  ## >>>  /rosout 
source $ROS_PACKAGE/setup.sh
rosrun rtk_sensor rtk_sensor   ### run successfully 
``` 
how to understand ?  once the docker container start with `host network`, the `roscore` run inside docker, is same as run in the host machine !! 
#### ads ros in docker talk to lgsvl in another docker with HOST network 
* once we start ads ros docker as following: 
```shell
sudo docker run -it --net host ads_ros /bin/bash
roscore 
``` 
* start lgsvl in docker: 
```shell
#! /usr/bin/env bash
xhost + 
sudo nvidia-docker run -it  -p 8080:8080  -e DISPLAY=unix$DISPLAY --net host -v /tmp/.X11-unix:/tmp/.X11-unix lgsvlsimulator /bin/bash

then access webUI in host machine, and add host: 10.20.181.132 in Clusters page, and add 10.20.181.132:9090 for Selected Vehicles. as lgsvl is also in host network. so these two docker can communicate through ros well !!

ads ros container talk to lgsvl container with ROS_IP

since lgsvl will run in docker swarm env, we can’t depend on host network, which requires ROS_IP env. the following test is in one host machine.

in host terminal

export ROS_MASTER_URI=http://192.168.0.10:11311
export ROS_HOSTNAME=192.168.0.10
export ROS_IP=192.168.0.10   
roscore 
``` 
* in ads ros docker 
```shell
sudo docker run -it \ 
     --env ROS_MASTER_URI=http://10.20.181.132:11311 \ 
     --env ROS_IP=10.20.181.132 \ 
     ads_ros /bin/bash
rosnod list 
``` 
however, when start `roscore` in docker, it reports:
```shell
Unable to contact my own server at [http://10.20.181.132:33818/].
This usually means that the network is not configured properly.
A common cause is that the machine cannot ping itself.  Please check
for errors by running:
	ping 10.20.181.132
``` 
if checking the IP address inside the docker container by `ifconfig`, which reports `172.17.0.3`, which then make sense that the container can't talk to `10.20.181.132`, which means we can't assign a special IP address for a docker container. so reset in the docker container as:
```shell
export ROS_MASTER_URI=http://172.17.0.3:11311
export ROS_HOSTNAME=172.17.0.3

actually, the host terminal can talk to the ads ros container directly, with no need to set $ROS_HOSTNAME & $ROS_MASTER_URI specially; as well as another docker container in this host machine, e.g. lgsvl.

a little bit knowledge about docker network. so each docker container does have an virtual IP, e.g. 172.17.0.1. while if run the docker image with host network, there is no special container IP, but the container directly share the IP of the host machine. as multi docker containers run in the same host machine, even without host network, they are in the same network range, so they can communicate to each other. additionaly for ros_master, which may requires to add $ROS_HOSTNAME & $ROS_MASTER_URI.

start lgsvl in another docker

#! /usr/bin/env bash
xhost + 
sudo nvidia-docker run -it \ 
     -p 8080:8080 \  
     -e DISPLAY=unix$DISPLAY \ 
     -v /tmp/.X11-unix:/tmp/.X11-unix \ 
     --env ROS_MASTER_URI=http://172.17.0.3:11311  \ 
     --env ROS_HOSTNAME=172.17.0.3   
     lgsvlsimulator /bin/bash

in summary

so far, we have host ads ros in one docker, and lgsvl in another docker, and they are in the same machine, and they can talk to each other. the next thing is to put ads ros and lgsvl in one image.

refer

rosdep

docker 1.10 container’s IP in LAN

listening to ROS messages in docker containers

exposing ROS containers to host machine

why you need IP address of Docker container

catkin_make not found in dockerfile

ROS ADS integrated to python talk to lgsvl

Posted on 2020-01-10 |

background

previously, we had done ADS ros package directly talk with lgsvl. the step further is to trigger ADS ros nodes from scenario python script

a few requirements:

for ADS ros should be integrated with each scenario python script, which means, when the python script finished, the ADS ros should exit
the ads ros package shell script is independ of python script

execute shell in Python

to call a shell command directly with os.system(cmd) or

1	subprocess.call("ls -lt", shell=True)

to run shell script with a python subprocess

1	proc=subprocess.Popen([""], stdout=subprocess.PIPE)

subprocess.Popen() return the process object

a few helpful commands to debug rosrun:

ps -aux | grep "roscore"  
ps -aux | grep "rosmaster"
killall -9 roscore 
killall -9 rosmaster

ros1 to python3

there is an real issue, due to the ADS ros package is mostly implemented by ROS1, maintained by the algorithm team. but lgsvl scenario is running with python3. ROS1 matches Python2. so there comes the solution, either to upgrade the ADS ros pakcage to ROS2, or find a way to run ROS1 in python3 env.

conda base env

we had decided to adapt ROS1 in python3 env. as the host machine has conda env, first need to disable conda base.

1	conda config --set auto_activate_base false

as lgsvl scenario is runned with conda python3 env, inside which used subprocess to run ADS ros shell, in which creates a few new gnome-terminals, which are non log-in terminal, and the trick things here: even though disabled auto activate base, and the terminal has no header (base), but when check the python path, it still points to the conda/bin/python, which will fail rosbridge_launch.server, which is a pure ros1 and python2 module.

so need check if conda --version in the ads ros shell script is in current terminal, if does, run conda deactivate, which gives error:

1	CommandNotFoundError: Your shell has not been properly configured to use 'conda deactivate'.

it looks there is some mess up, with init conda.sh is in ~/.bashrc, but during the new terminal ceated, it doens’t confgiure all right. which can be fixed :

conda --version 
if [ $? == 0 ]
then 
    source ~/anaconda3/etc/profile.d/conda.sh
    conda deactivate
fi

in this way, we run lgsvl scenario in python3, as well as can new with python2 terminals to run ads ros nodes from this python3 terminal

ros scripts path is not python path

as we try to separate python scripts from ads ros nodes, so need some global env variable for the ros scripts path.

kill all ros nodes gracefully

we need restart/shutdown the ads ros nodes package at each time when python scenario script start/stop, which has two steps:

to shutdown ros nodes, e.g. rosnode kill
to close all the gnome-terminals to run the ros nodes

the second step is very tricky, need some understand. A terminal is a file. Like /dev/tty. Files do not have a process id. The process that “owns” the terminal is usually called the controlling process, or more correctly, the process group leader. gnome-terminal runs a single pid, it creates a child process for each and every window and/or tab. and can retrieve these child processes by the command:

$ pgrep -P <pid_of_gnome-terminal>

Many terminals seem to mask themselves as xterm-compatible, which is reported by echo $TERM or echo $COLORTERM.

1
2
3

$! is the PID of the last backgrounded process.
kill -0 $PID checks whether it's still running.
$$ is the PID of the current shell.

#! /bin/bash

gnome_pid=`pgrep gnome-terminal`
subpids=`pgrep -P ${gnome_pid}`

if [ ! $1 ]; then
   echo "missing the gnome-terminal pid, exit"
   exit -1
fi

#shutdown the ros nodes 
rosnode kill -a  > /dev/null  2>&1
killall -9 rosmaster > /dev/null  2>&1

#shutdown the terminals where ros nodes hosted
for pid in $subpids
do
        if [[ $pid != $1 ]] ; then
                echo $pid
                kill $pid
        fi
done

tips: the implement has a litle problem when integrate with lgsvl when the simulation time is short, e.g. 1sec, then ros_clear.sh can’t catch the opened gnome-terminals from ros_start.sh.

find the process name using PID

$ ps aux | grep PID
$ ps -p PID -o format         
$ curpid=`ps -p $$ -o ppid=`

the output of ps aux

–> in the foreground process group
s –> is a session leader

summary

so far, we integrated ads ros packages into python, which then talk to lgsvl.

refer

pythn subprocess from Jianshu

understand linux process group

pidof command

which_term

get the info of a PID

get the pid of running terminal

ADS stack in ros talk to lgsvl

Posted on 2020-01-04 |

learning is about repeat 100 times !

this maybe the third time to go through ROS, which still looks fresh, but does give a whole picture about how ROS works in ADS dev.

what is ros topic and message

the topic is the channel where nodes are subscribed for to read messages or where the nodes publish those messages; the message is the data itself, previously defined.

rostopic echo [topic]
rostopic list 
rosmsg show [message]
rosrun [package_name] [node_name]
rospack find [package_name]
rosnode info [node_name]

what is a publisher/subscriber

a publisher (node) publishes messages to a particular topic. publish() is asynchronous, and only does work if there are subscribers connected on that topic. publish() itself is very fast, and it does as little work as possible:

1 2	serialize the message to a buffer push the buffer onto a queue for later processing

1	ros::Publisher advertise(const std::string& topic, uint32_t queue_size, bool latch = false);

the queue size is defined in publihser/outgoing message queue. if publishing faster than roscpp can send the message over the wire, roscpp will start dropping OLD messages.

1	ros::Subscriber subscribe(const std::string& topic, uint32_t queue_size, <callback, which may involve multiple arguments>, const ros::TransportHints& transport_hints = ros::TransportHints());

the queue_size is the incoming message/subscriber queue size, roscpp will use for your callback. if messages are arriving too fast and you are unable to keep up, roscpp will start throwing away OLD messages.

in summary:

* publish() is asynchronous.
* When you publish, messages are pushed into a queue (A) for later processing. This queue is immediately pushed into the outgoing/publisher queue (B) . PS: If no one subscribes to the topic, the end is here.
* When a subscriber subscribes to that topic. Messages will be sent/pushed from the corresponding outgoing/publisher queue (B) to the incoming/subscriber queue (C).--> this is done by internal thread
* When you spin/ callback, the messages handled are from the incoming/subscriber queue (C).

messages

msg files are simple text files for specifying the data structure of a message. These files are stored in the msg subdirectory of a package. the publisher and subscriber must send and receive the same type/topic of message

1	{"op": "publish", "topic": "/talker", "msg": {"data" : "_my_message" }}

lgsvl defines the rosBridge class, which has a Reader() and Writer(), corresponding to Subsciber and Publisher, respectfually.

rosBridge in lgsvl

rosbridge is an adapter for non-ros apps to talk with ros. it’s common to package ADS software as ros node during dev stage, and rosbridge is the common way to integrate simulator with ADS software.

the base implementation of rosBridge in lgsvl is as following:

RosBridge.cs

1 2	ConcurrentQueue<Action> QueuedActions ; Dictionary<string, Tuple<Func<JSONNode, object>, List<Action<object>>>> Readers ;

the topic is packaged as:

{ "op":  "subscribe or publish or call_service or service_response or set_level",
  "topic":  {},
  "type":  {}
}

AddReader(topic, callback)

a few types/topices supported:

Deteced3DObjectArray, 

Deteced2DObjectArray, 

VehicleControlData, 

Autoware.VehicleCmd, 

TwistStamped, 

Apollo.control_command

which is the list of ros messages that lgsvl can parsing.

if(!Readers.ContainsKey(topic))
{
   Readers.Add(topic, Tuple.Create<Func<JSONNode, object>, List<Action<object>>>(msg=>converter(msg),  new List<Action<object>>()) ) ;
}  
Readers[topic].Item2.Add(msg=>callback(T)msg));

AddReader is the subscriber nodes in lgsvl. in Sensor group, there are three sensors do AddReader:

GroundTruth3DVisualizer 

VehicleControlSensor 

GroudTruth2DVisualizer

which means, lgsvl server can read these three typies of messsages.

if there is no rendering/visual needs, only vehicle conroller (acc, brake) message is required for lgsvl.

AddWriter(topic)

the types/topic supported are:

ImageData, 
PointCloudData, 
Detected3DObjectData, 
Detected2DObjectData, 
SignalDataArray, 
DetectedRadarObjectData, 
CanBusData, 
GpsData, 
ImuData, 
CorrectedImuData, 
GpsOdometryData, 
ClockData

AddWriter() is a writer adapter, which returns a special type/topic writer/publisher. AddWriter is the publisher nodes in lgsvl, in Sensor group, the following sensors can publish message out:

LidarSensor 
SignalSensor
 GpsInsSensor
GpsOdometrySensor
DepthCameraSensor
ImuSensor
SemanticCameraSensor
GroudTruth2DSensor
CanBusSensor
RadarSensor
GroudTruth3DSensor
ClockSensor
GpsSensor
ColorCameraSensor

AddService(topic, callback)

OnMessage(sender, args)

if (args.op=="publish")
{
   topic = json["topic"]
   Readers.TryGetValue(topic, out readerPair)
   var parse = readerPair.Item1 ; 
   var readers = readerPair.Item2; 
   var msg = parse(json["msg"]);
   foreach (var reader in readers)
   {
     QueuedActions.Enqueue(()=>reader(msg));   //
   }

ros subscriber uses a callback mechanism, when the message is coming, all readers who subscribe this topic will read in this message.

RosWriter.cs

the message output is in the format as:

{ 
  "op" :  "publish" ,
  "topic" :  Topic, 
  "msg":  message 
}

websocket in ros bridge

the implementation of ros bridge in lgsvl is by WebSocket, which maintained a continously communication pipeline, for external ros nodes publishers.

ros ads talk to lgsvl rosbridge

from the previous section, lgsvl talk to external ros nodes (which is the ADS stack) through rosbridge, and which needs external inputs, e.g. vehicle control command, 2d/3d ground truth visualizer. and the ADS stack ros nodes can subscribes Gps, canbus, Lidar, Radar, GroudTruth, SemanticCamera, DeepCamera topics through lgsvl rosbrige.

so the pipeline are simple as following:

lgsvl rosbridge  -->  {gps, radar, camera, lidar, groudTruth message} -->  ADS stack nodes 

ADS stack nodes -->  {vehicle control command message} --> lgsvl rosbridge

usually, the ADS stack nodes are a few related ros nodes, including RTK/GPS sensor, Lidar/Camera/Radar sensors, hdmap node, data fusion node e.t.c.

run ROS ADS stack

the following nodes are used commonly in ADS dev:

gps_node

subscribe: /gps, /odom e.t.c

publish: /sensor/data topics

camera_node

subscribe: /camera/raw_data

publish:  /objects_list,  /lane_object e.t.c

radar_node

subscribe:  /radar/raw_data

publish: /obstacle/velocity , /obstacle/distance

hdmap_node

subcribe: /rtk/data, /ins/data. 

publish:  /lane/info,  /speed_limit e.t.c.

fusion_node

subscribe topics: /canbus, rtk/data, /ifv_lane,  /radar/blind, /radar/corner, /esr, /velodey, /lane/info e.t.c

publish topics: /object_list/, /road/info, /vehicle/status and visual related messages e.t.c.

planning_node

subscribe topics:  /object_list,  /road/info, /vehicle/status from fusion_node

publish topics: /vehicle/cmd_ctl

ADS ros launch

the previous section has a basic idea about what kind of ROS nodes usually needs to run the ADS stack. In road test or simulaiton test, we prefer a simple way to launch all ADS related nodes in one launch.

which usually give a simple launch file to start all nodes, or a shell script to start each ros nodes sequencially.

system verification

ros is a common solution to package ADS stack and integrate with the simulation env during SIL sysem verification, as well as road test data collection. since ROS is the cheap solution for ADS dev and test, compared to CAEape/Vector, Dspace solution.

on the other hand, due to the hardware computing limitation, drivers depends, and espeically the ros is mostly run in Linux kind of non-realtime OS, which is not good for domain controller(HAD) test. so during the real test envrionment, we need to consider these true issues.

another topic is about how to quick build up a user-friend verification pipeline for, which keeps as little modification as possible, but can run in both simulation verification and physical verification.

refer

ros overview

understand rosbridge: simple ROS UI

roslibjs

mobile vehicle interface

Posted on 2019-12-29 |

background

human-machine interface(HMI), in-vehicle device(IVD).

human-vehicle interface(HVI), e.g. voice control, touch screen, face detection, e.t.c, which does well for young and passion generation, who have great voice, sharp fingers, and good-look face, and all these improving user experienced tech are based on similar AI.

what about the elders ? which should be, in the next 10 - 20 years, the most tough family and society issues in China, and as well as for Japan right now.

the basicaly option here is HVI is not enough for the elders groups, or at least is not the only choice for elders, who don’t have great voice, good-look face, or sharp fingers, and even no patient to talk/look/touch with the weak AI system. a very good alternative or additional supportting solution should be mobility vehicle interface(MVI)

mobility vehicle interface

what kind of mobility device

mobility is a general name, for all kinds of mobile devices, especially for Iphone, Android smart phone and personal care robot.

application scenarios

for an elder, who has daily needs for traffic transfer. e.g. go to hospital, to supermarket, to a special restuarant or a park for dinner or sit down with some old friends.

the elders are slow in movement and talk, and the current AI based human-vehicle-interface(HVI) definitely make the elders feel pressure.

a better or alternative solution is let the mobility device to talk to the vehicle, through an interface

mobility vehicle interface

the mobility vehicle interface(MVI) can be based on existing vehicle OS and mobility OS. there are plenty existing in-vehicle OS, e.g. CarPlayer, Baidu OS, GENIV, QNX e.t.c; and on mobility OS side, the most common are iOS and Andriod.

most in-vehicle OS can run the same apps on mobility OS, e.g. google navigation map, instant messages, music, emergency call service e.t.c.

so the first solution is app2app. basically the same app ran in the personal mobility device(PMD) talk to the same app run in vehicle OS.

PMD has more time with the owner, so has more personality than vehicle, especially as vehicle in future is more like a public service, rather than a personal asset.

that’s why mobility vehicle interface(MVI) is a good option, especially for elders, who may not enjoy talk to AI.

beyond the easy to implement at this moment, app2app solution has a few limitations:

the security is mainly provided by the app supplier, which is not a unit solution, as different app suppliers have different security mechanism.
as apps hosted in this system is growing, the adapters or interfaces to make the bridge grows too, which decrease the user experince and increase the system cost.

so a better solution is a new mobility-vehicle interface protocol, which is the only bridge between personal mobility and vehicles. and no matter what kind of apps and how many apps hosts in both system, won’t be a burden for the sytem anymore.

moblity vehicle interface protocol

can MaaS survive in China

Posted on 2019-12-29 |

the difference of US and China: citizenship vs relationship

the first class cities, e.g. Beijing, Shanghai, Shenzhen, are not different from Chicago, New York, in normal people’s lifestyle: they share the same luxurious brands, Starbucks, Texas beaf Steak, city public services, and the same international popular elements in eletronic consumers, clothing, vehicles, and even the office env.

However, down to the third class, or forth class cities in China and US, there are a huge difference.

the bottom difference is citizenship（公民意识） vs relationship（关系文化). in US and most developed countries, citizenship is a common sense, no matter in small towns or big cities. in China, the residents in big cities are similar to residents in developed countries; but the normal people in third cities value more about relationship, rather than citizenship, so basically the rules how to live a high-qualitied/successful life in these cities is not universal, which means if a resident from big cities jumping to these small cities, his experince about what is a good career/life choice has to be changed, and further which has a great influence about consuming habit and the acceptance of emerging market.

the Chinese goverment is pushing urbanization in most uncitizenlized areas, hopefully this process can be achieved in a few generations, which can be affected by both goverment policy and the external forces, e.g. trade war. any way, there is no short way.

Chinese subside markets

in China, one kind of the most profit business is e-trade, e.g. Alibaba, JD, Pinduoduo, e.t.c. they are sinking to the third/forth cities in China in recent years, which is a special phenomenon in China, the reason as I see, is due to the division between citizenship in top class cities and relationship in most small cities in China.

for most developed countries, e.g. US, the market is so flat that once one product/service is matured in big cities, there is no additional cost to expand to small towns in national wide. But here in China, the market, the society structure, the resident’s consuming habit are not flat due to the division as mentioned previously. so they need different bussniess strategy for product/service in big cities and most small towns.

for the emerging market, the investors and service/product providers need input from top consulting teams, e.g. PWC, Deloitte, BCG, but the research paper from these teams try to ignore the value gap in Chinese large cities and small cities.

Of course I can understand the consulting strategy, as emerging market is looking for new services in near future, and it should looks promising. taking mobility as a service (MaaS) as an example, from sharing cars to MaaS is likely happened in urban areas in next 10 years, and expand to most areas in west European and NA countries, but in the most small towns of China, it may never happen.

MaaS is a promising service if the society and resident’s value are similar (or plat). for developing countries, e.g. China, India, these emerging market wouldn’t be a great success in national wide.

start-ups in MaaS

public resources in moblity as a service(MaaS)

Mass alliance

International parking & mobility institute

shared mobility services in Texas

DI_Forces of change-the future of mobility

PWC_how shared mobility and automation will reolution

Princeton_strategies to Advanced automated and connected vehicles: a primer for state and local decision makers

Accenture_mobility as a service whitepaper

Bosch_HMI

Toyota_Mobility ecosystem

Volkswagen_E-mobility module

Siemens_Intelligent Traffic Systems

MaaS in UK

the tech liberation front

autonomous vehicle technology

configure hadoop in 2-nodes cluster

Posted on 2019-12-28 |

background

it’s by accident that I have to jump into data center, where 4 kinds of data need deal with:

sensor verification, with huge amount of special raw sensor data
AI perception training, with huge amout of fusioned sensor data
synthetic scenarios data, which used to resimualtion
inter-middle status log data for Planning and Control(P&C)

big data tool is a have-to go through for L3+ ADS team, which has already developed in top start-ups, e.g. WeRide and Pony.AI, as well as top OEMs from NA, Europen. Big data, as I understand is at least as same important to business, as to customers. compared to AI, which is more on customer’s experience. and 2B is a trending for Internet+ diserving into traditional industry. anyway, it’s a good try to get some ideas about big data ecosystem. and here is the first step: hadoop

prepare jdk and hadoop in single node

Java sounds like a Windows langage, there are a few apps requied Java in Ubuntu, e.g. osm browser e.t.c., but I can’t tell the difference between jdk and jre, or openjdk vs Oracle. jdk is a dev toolkit, which includes jre and beyond. so when it’s always better to set JAVA_HOME to jdk folder.

jdk in ubuntu

there are many different version of jdk, e.g. 8, 9, 11, 13 e.t.c. here is used jdk-11, which can be download from Oracle website, there are two zip files, the src and the other. the pre-compiled zip is enough to Hadoop in Ubuntu.

tar xzvf  jdk-11.zip  
cp -r jdk-11  /usr/local/jdk-11 
cd /usr/local
ln -s jdk-11 jdk

append JAVA_HOME=/usr/local/jdk && PATH=$PATH:$JAVA_HOME/bin to ~/.bashrc, and can run test java -version.

what need to be careful here, as the current login user may be not fitted for multi-nodes cluster env, so it’s better to create the hadoop group and hduser, and use hduse as the login user in following steps.

create hadoop user


sudo addgroup  hadoop
sudo adduser --ingroup hadoop  hduser 
sudo - hduser #login as hduser

the other thing about hduser, is not in sudo group, which can be added by:

curren login user is hduser:

groups  # hadoop
su -  # but password doesn't correct
#login from the default user terminal
sudo -i  
usermod -aG sudo hduser
#backto hduser terminal
groups hduser  # :  hadoop  sudo 
exit 
su - hduser  #re-login as hduser 
``` 
#### install and configure hadoop 
hadoop installation at Ubuntu is similar to Java, which has src.zip and pre-build.zip, where I directly download the `pre-build.zip`.
another thing need take care is the version of hadoop. since `hadoop 2.x` has no `--daemon` option, which will leads error when master node is with `hadoop 3.x`.
```shell
tar xzvf  hadoop-3.2.1.zip  
cp -r hadoop-3.2.1 /usr/local/hadoop-3.2.1
cd /usr/local
ln -s hadoop-3.2.1 hadoop

add HADOOP_HOME=/usr/local/hadoop and PATH=$PATH:$HADOOP_HOME/bin to ~/.bashrc. test with hadoop version

hadoop configure is find here

there is another issue with JAVA_HOME not found, which I modify the JAVA_HOME variable in $HADOOP_HOME/etc/hadoop/hadoop_env.sh

passwordless access among nodes

generate SSH key pair

on maste node:
ssh-keygen -t rsa -b 4096 -C "master"
on worker node:
ssh-keygen -t rsa -b 4096 -C "worker"

the following two steps need do on both machines, so that the local machine can ssh access both to itself and to the remote.

enable SSH access to local machine

ssh-copy-id hduser@192.168.0.10

copy public key to the remote node

ssh-copy-id hduser@192.168.0.13

tips, if changed the default id_rsa name to sth else, doesn’t work. after the changes above, will generates a known_hosts at local machine, and an authorized_keys, which is the public key of the client ssh, at remote machine.

test hadoop

on master node

hduser@ubuntu:/usr/local/hadoop/sbin$ jps
128816 SecondaryNameNode
128563 DataNode
129156 Jps
128367 NameNode

on worker node:

1
2
3

hduser@worker:/usr/local/hadoop/logs$ jps
985 Jps
831 DataNode

and test with mapreduce

deploy lgsvl in docker swarm 4

Posted on 2019-12-25 |

background

previously, tried to deploy lgsvl in docker swarm, which is failde due to the conflict of host network to run lgsvl and the routing mesh of swarm, as I thought.

http listen on *

why used –network=host, is actually not a have-to, the alternative option is to use "*" as Configure.webHost, instead of localhost nor a special IP address, which lead o HttpListener error:

1	The requested address is not vaid in this context.

then, we can docker run lgsvl without host network limitations.

but still, if run by docker service create, it reports failure: Error initiliazing Gtk+.

Gtk/UI in Unity

when starting lgsvl, it pops the resolution window, which is a plugin of Unity Editor, and implemented with gtk, as explained in last section, which leads to the failure to run lgsvl as service in docker swarm.

the simple solution is to disable resolution selection in Unity Editor.

1	Build Settings --> Player Settings --> Disable Resolution

then the popup window is bypassed.

ignore publish port

I tried to ignore network host and run directly with routing mesh, but it still doesn’t work. then I remember at the previous blog, when run vkcube or glxgears in docker swarm, it actually does use --network host, so it looks the failure of running lgsvl in docker swarm, is not due to network host, but is due to Gtk/gui. as we can bypass the resolution UI, then directly running as following, works as expected:

1	sudo docker service create --name lgsvl --generic-resource "gpu=1" --replicas 2 --env DISPLAY --mount src="X11-unix",dst="/tmp/.X11-unix" --network host lgsvl

add assets into container

another update is to bind assets from host into lgsvl image, which is stored as sqlite data.db, which is a necessary, as we bypassed the authentication, and the cluster has no access to external Internet.

where is next

in recent two month, had digged into docker swarm to run lgsvl. so far, the main pipeline looks work now, and there are still a lot little fix there.

to run AV simulation in cloud, is a necessary way to test and verify L3+ AV algorithms/products. previous ADAS test is more on each individual feature itself, e.g. ACC, AEB .e.t.c, all of which are easy to define a benchmark test case, and engineers can easily define the test scenarios systemetically. But for L3+, the env status space is infinite in theory, there is no benchmark test cases any more, and at best we can do is statiscally test cases, which requires a huge number of test cases, which is where virutal simulation test in cloud make sense.

from tech viewpoint, the next thing is how to drive the L3+ dev by these simulation tools. and another intesting is the data infrastructure setup

web service bypass in lgsvl

Posted on 2019-12-15 |

background

previously had talked lg new version code review, where introduced the new server-browser arch, which is focused on the lgsvl server side implementation, which was based on Nancy and sqliteDB; also a simple introduction about reactjs

the gap how client send a http request to the server is done by axios. also another issue is how to access asset resource across domain. namely, running the lgsvl service at one host(192.168.0.10), and http request send from another remote host(192.168.0.13).

Axios

the following is an example on how Axios works.

constructor() {
   this.state = {
     user: null 
   }
 }
componentDidMount() {
  axios.get('https://dog.ceo/api/breeds/image/random')
  .then(response => {
    console.log(response.data);
    if(response.status == 200)
        setState(user, reponse.data)
  })
  .catch(error => {
    console.log(error);
  });
}
 render() {
   return (
  )
 }

from React componnet to DOM will call componentDidMount(), inside which axios send a GET request to https://dog.ceo/api/breeds/image/random for a random dog photo. and can also store the response as this component’s state.

enact

enact is a React project manager, the common usage:

1
2
3

enact create . # generate project at current dir 
npm run serve 
npm run clean

enact prject has a configure file, package.json, while can specify the proxy, which is localhost by default. if want to bind to a special IP address, this is the right place to modify.

"enact": {
        "theme": "moonstone",
        "proxy": "http://192.168.0.10:5050"
},

inside lgsvl/webUI, we need do this proxy configure, to support the across-domain access.

Nancy authentication

this.RequiresAuthentication(), which ensures that an authenticated user is available or it will return HttpStatusCode.Unauthorized. The CurrentUser must not be null and the UserName must not be empty for the user to be considered authenticated. By calling this RequiresAuthentication() method, all requests to this Module must be authenticated. if not authenticated, then the requests will be redirected to http://account.lgsimulator.com. You need to include the types in the Nancy.Security namespace in order for these extension methods to be available from inside your module.

this.RequiresAuthentication() is equal to return (this.Context.CurrentUser == null) ? new HtmlResponse(HttpStatusCode.Unauthorized) : null;

all modules in lgsvl web server are authenticated by Nancy:RequiresAuthentication(), for test purpose only, we can bypass this function, and pass the account directly:

//  this.RequiresAuthentication();
//  return service.List(filter, offset, count, this.Context.CurrentUser.Identity.Name)
string currentUsername = "test@abc.com";
return service.List(filter, offset, count, currentUsername)

in this way, no matter what’s the account in React client, server always realize the http request is from the user test@abc.com.

sqlite db

in Linux, sqlite data.db is stored at ~/.config/unity3d/<company name>/<product name>/data.db

in Windows, data.db is stored at C:/users/username/AppData/LocalLow/<company name>/<product name>/data.db

it’s interesting when register at lgsvlsimualtor.com, and it actually send the account info back to local db, which give the chance to bypass.

debug webUI

the chrome and firefox has react-devtools plugins, which helps, but webUI doesn’t use it directly. to debug webUI it’s even simpler to go to dev mode in browser, and checking the few sections is enough

refer

bring your data to the front

deploy lgsvl in docker swarm 3

Posted on 2019-12-15 |

background

previously tried to run Vulkan in virtual display, which failed as I understand virtual display configure didn’t fit well with Vulkan. so this solution is direct display to allow each node has plugged monitor(which is called PC cluster). for future in cloud support, current solution won’t work. and earlier, also tried to deploy lgsvl in docker swarm, which so far can work with Vulkan as well, after a little bit understand X11

a few demo test can run as followning:

deploy glxgears/OpenGL in PC cluster

1
2
3

export DISPLAY=:0 
xhost + 
sudo docker service create --name glx --generic-resource "gpu=1" --constraint 'node.role==manager'  --env DISPLAY --mount src="X11-unix",dst="/tmp/.X11-unix" --mount src="tmp",dst="/root/.Xauthority"  --network host  192.168.0.10:5000/glxgears

deploy vkcube/Vulkan in PC cluster

1
2
3

export DISPLAY=:0 
xhost +  
sudo docker service create --name glx --generic-resource "gpu=1" --constraint 'node.role==manager'  --env DISPLAY --mount src="X11-unix",dst="/tmp/.X11-unix" --mount src="tmp",dst="/root/.Xauthority"  --network host  192.168.0.10:5000/vkcube

deploy service with “node.role==worker”

1
2
3

export DISPLAY=:0
xhost + 
sudo docker service create --name glx --generic-resource "gpu=1" --constraint 'node.role==worker'  --env DISPLAY --mount src="tmp",dst="/root/.Xauthority"  --network host  192.168.0.10:5000/glxgears

deploy service in whole swarm

1
2
3

xhost + 
export DISPLAY=:0 
sudo docker service create --name glx --generic-resource "gpu=1" --replicas 2  --env DISPLAY --mount src="tmp",dst="/root/.Xauthority"  --network host  192.168.0.10:5000/vkcube

which deploy vkcube service in both manager and worker node:

overall progress: 2 out of 2 tasks 
1/2: running   [==================================================>]     
2/2: running   [==================================================>] 
verify: Service converged

understand .Xauthority

as docker service can run with --mount arguments, which give the first try to copy .Xauthority to manager node, but .X11-unix is not copyable, which is not a normal file, but a socket.

in docker service create, when create a OpenGL/vulkan service in one remote worker node, and using $DISPLAY=:0, which means the display happens at the remote worker node. so in this way, the remote worker node is played the Xserver role; and since the vulkan service is actually run in the remote worker node, so the remote worker node is Xclient ?

assuming the lower implement of docker swarm serviceis based on ssh, then when the manager node start the service, it will build the ssh tunnel to the remote worker node, and with the $DISPLAY variable as null; even if the docker swarm can start the ssh tunnel with -X, which by default, will use the manager node’s $DISPLAY=localhost:10.0

Xauthority cookie is used to grant access to Xserver, so first make sure which machine is the Xserver, then the Xauthority should be included in that Xserer host machine. a few testes:

ssh in worker: echo $DISPLAY -->  localhost:10.0
xeyes --> display in master monitor 
ssh in worker: xauth list --> {worker/unix:0  MIT-MAGIC-COOKIE-1  19282b0a651789ed27950801ef6f1441; worker/unix:10  MIT-MAGIC-COOKIE-1  a6cbe81637207bf0c168b3ad20a9267a }
in master: xauth list --> { ubuntu/unix:1  MIT-MAGIC-COOKIE-1  ee227cb9465ac073a072b9d263b4954e; ubuntu/unix:0  MIT-MAGIC-COOKIE-1  75893fb66941792235adba22362c4a6f; ubuntu/unix:10  MIT-MAGIC-COOKIE-1  785f20eb0ade772ceffb24eadeede645 }

so which cookie is is for this $DISPLAY ? it shouldb be the one on ubuntu/unix:10;

1 2	ssh in worker: export DISPLAY=:0 xeyes --> display in worker monitor

then it use the cookie: worker/unix:0.

deploy lgsvl service in swarm

1
2
3

xhost + 
export DISPLAY=:0
sudo docker service create --name lgsvl --generic-resource "gpu=1" --replicas 2  --env DISPLAY --mount src="tmp",dst="/root/.Xauthority"  --network host --publish published=8080,target=8080  192.168.0.10:5000/lgsvl

which gives:

overall progress: 0 out of 2 tasks 
1/2: container cannot be disconnected from host network or connected to host network
2/2: container cannot be disconnected from host network or connected to host network

basically, the service is deployed in ingress network by default, but as well, the service is configured with host network. so it conflict.

swarm network

the routing mesh is the default internal balancer in swarm network; the other choice is to deploy serivce directly on the node, namely bypassing routing mesh, which ask the service run in global mode and with pubished port setting as mode=host, which should be the same as --network host in replicas mode.

the limitation of bypassing routing mesh, is only one task on one node, and access the published port can only require the service from this special node, which doesn’ make sense in cloud env.

docker service create \
     --mode global \
     --publish mode=host,target=80,published=8080 \
     --generic-resource "gpu=1"  \
     --env DISPLAY  \
     --mount src="tmp",dst="/root/.Xauthority"  \
     --mount src="X11-unix",dst="/tmp/.X11-unix"  \
     --network host  \
     --name lgsvl \
     lgsvl:latest

tips: –mount src=”X11-unix”,dst=”/tmp/.X11-unix” is kind of cached. so once docker image has ran in worker node, then it doesn’t need to pass this parameter again, but once worker node restart, it need this parameter again

in summary about swarm netowrk, the routing mesh should be the right solution for cloud deployment. so how to bypass --network host ?

the reason of host network is the lgsvl server and webUI can works in same host; if not, there is kind of cross-domain security failures, which actually is another topic, namely, how to host lgsvl and webUI/React in different hosts.

need some study of webUI in next post.

where are you in next 4 years (1)

Posted on 2019-12-08 |

it’s more than one year since I started this series “where are you in the next 5 years”, I wound love to transfer to “the next 4 years”, and thanks for the opportunity to get back in China, so there is a high chance to involve in the market heavily in a short time.

at the begining of the year, travelled around the whole nation, stay in Shanghai, Beijing, Shenzhen, Guangzhong. and that was a great chance to get familiar with the startups in autonomus vehicle, as this trip really gave me some input, and till now I had another half year in one of the top OEMs in China. combined with this two sources, which gave me the kind of the whole picture of ADS market happening in China. I would love to write this blog more in bussiness thought, rather than engineering way.

L4 startups

ADS leap time is about 2016 to the first half year of 2018. there are a bunch of startups and also most OEMs have build their ADS teams.

the startups, e.g. Pony.ai, WeRide, AutoX, roadstar(the new split), ToSimple, Momenta. which are still very active recently.

Today I have taken PlusAI’s tech open day, I have to say, most of these startups have very similar tech roadmap. I personally, think that’s a really sad thing.

a few teches they all have:

simulation pipeline
data pipeline(collect, label, training)
AI based perception, motion planning
friendly HMI

WeRide and Pony.ai are in Robtaxi service; ToSimple and PlusAI are in highway logistics; Momenta is in harbor transformation.

Alibaba, jingdong, Meituan e.t.c are in small personal package delivery shuttles, similar as Nuro.

OEMs focus in the passenger vehicles.

DiDi focus in taxi services as well, similar as Uber, Waymo.

all of them can be called as ADS service suppliers. however, most of them use the exactly same sensor packages, including Lidar, Camera, Radar, GPS e.t.c. the software stacks during prodcut dev as mentioned above are mostly similar; there maybe a few special features during the services in deployment, e.g. Robtaxi may have a Uber-like call-taxi app e.t.c, Rather than that, nothing really is amazing about ADS itself.

and mostly this is not a tech problem, it’s must be defined or find out by the social guys, who are from the real needs.

in the engineering work environment, it’s easily to misunderstand the role of engineering. engineering is the bumper, only when the house need fixed, there is a need for bumper. However, in an engineering-centered env, it’s so easy to tell no difference between I have the bumper and I have the needs. My experieince till now, I am learning how to use the bumper well, but few thinks why need to learn to use the bumper.

on the other hand, what kind of tech is really helpful or profitable?

by chance, to talk with Unity China team, who are enhancing Unity3D engine with cloud support, unity simualtion, which is the feature I am looking for a while. if the tech pipeline is the waterflow, the Unity team is the one standing at the upper flow, who can implement new features in engine.

just like Nvidia, Google e.t.c, these are the guys who really make a difference with their tech. and it’s profitable of course.

David Z.J. Lee

what I don't know

GitHub LinkedIn