configure hadoop in 2-nodes cluster

background

it’s by accident that I have to jump into data center, where 4 kinds of data need deal with:

  • sensor verification, with huge amount of special raw sensor data

  • AI perception training, with huge amout of fusioned sensor data

  • synthetic scenarios data, which used to resimualtion

  • inter-middle status log data for Planning and Control(P&C)

big data tool is a have-to go through for L3+ ADS team, which has already developed in top start-ups, e.g. WeRide and Pony.AI, as well as top OEMs from NA, Europen. Big data, as I understand is at least as same important to business, as to customers. compared to AI, which is more on customer’s experience. and 2B is a trending for Internet+ diserving into traditional industry. anyway, it’s a good try to get some ideas about big data ecosystem. and here is the first step: hadoop

prepare jdk and hadoop in single node

Java sounds like a Windows langage, there are a few apps requied Java in Ubuntu, e.g. osm browser e.t.c., but I can’t tell the difference between jdk and jre, or openjdk vs Oracle. jdk is a dev toolkit, which includes jre and beyond. so when it’s always better to set JAVA_HOME to jdk folder.

jdk in ubuntu

there are many different version of jdk, e.g. 8, 9, 11, 13 e.t.c. here is used jdk-11, which can be download from Oracle website, there are two zip files, the src and the other. the pre-compiled zip is enough to Hadoop in Ubuntu.

1
2
3
4
tar xzvf jdk-11.zip
cp -r jdk-11 /usr/local/jdk-11
cd /usr/local
ln -s jdk-11 jdk

append JAVA_HOME=/usr/local/jdk && PATH=$PATH:$JAVA_HOME/bin to ~/.bashrc, and can run test java -version.

what need to be careful here, as the current login user may be not fitted for multi-nodes cluster env, so it’s better to create the hadoop group and hduser, and use hduse as the login user in following steps.

create hadoop user

1
2
3
4
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo - hduser #login as hduser

the other thing about hduser, is not in sudo group, which can be added by:

curren login user is hduser:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
groups # hadoop
su - # but password doesn't correct
#login from the default user terminal
sudo -i
usermod -aG sudo hduser
#backto hduser terminal
groups hduser # : hadoop sudo
exit
su - hduser #re-login as hduser
```
#### install and configure hadoop
hadoop installation at Ubuntu is similar to Java, which has src.zip and pre-build.zip, where I directly download the `pre-build.zip`.
another thing need take care is the version of hadoop. since `hadoop 2.x` has no `--daemon` option, which will leads error when master node is with `hadoop 3.x`.
```shell
tar xzvf hadoop-3.2.1.zip
cp -r hadoop-3.2.1 /usr/local/hadoop-3.2.1
cd /usr/local
ln -s hadoop-3.2.1 hadoop

add HADOOP_HOME=/usr/local/hadoop and PATH=$PATH:$HADOOP_HOME/bin to ~/.bashrc. test with hadoop version

hadoop configure is find here

there is another issue with JAVA_HOME not found, which I modify the JAVA_HOME variable in $HADOOP_HOME/etc/hadoop/hadoop_env.sh

passwordless access among nodes

  • generate SSH key pair
1
2
3
4
on maste node:
ssh-keygen -t rsa -b 4096 -C "master"
on worker node:
ssh-keygen -t rsa -b 4096 -C "worker"

the following two steps need do on both machines, so that the local machine can ssh access both to itself and to the remote.

  • enable SSH access to local machine

    ssh-copy-id hduser@192.168.0.10

  • copy public key to the remote node

    ssh-copy-id hduser@192.168.0.13

tips, if changed the default id_rsa name to sth else, doesn’t work. after the changes above, will generates a known_hosts at local machine, and an authorized_keys, which is the public key of the client ssh, at remote machine.

test hadoop

  • on master node
1
2
3
4
5
hduser@ubuntu:/usr/local/hadoop/sbin$ jps
128816 SecondaryNameNode
128563 DataNode
129156 Jps
128367 NameNode
  • on worker node:
1
2
3
hduser@worker:/usr/local/hadoop/logs$ jps
985 Jps
831 DataNode

and test with mapreduce