k8s: flannel

background

k8s networks include a few topics:

pod to pod communication in k8s
pod to the host node communication
pod to external service URL
external request to pod in k8s

the networks have two components: DNS and iptables. DNS used to resovle URL name to IP; iptables used to control network message transfer in/out.

preparation image for test

to test network inside pod/docker, we need add the following network tools:


iputils-ping \ 
net-tools \ 
iptables \
iproute

docker/pod runtime privileges

by default, docker doesn’t allow run iptables inside container. and it give errors:

1
2
3

root@redisjq:/# iptables -t nat -L | grep INPUT 
iptables v1.6.0: can't initialize iptables table `nat': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.

which need to add docker runtime privilege and Linux capabilities

container capabilities in k8s

In a Kubernetes pod, the names are the same, but everything has to be defined in the pod specification. When implementing this in Kubernetes, you add an array of capabilities under the securityContext tag.

securityContext:
  capabilities:
    add:
       - NET_ADMIN

k8s pod DNS

DNS for services and pods introduced four types:

None
Default， where POD derived DNS config from the host node where to run pod.
ClusterFirst， where POD use DNS info from kube-dns or coreDNS
ClusterFirstWithHostNet, as the name explained.

tips, Default is not the default DNS policy. If dnsPolicy is not explicitly specified, then ClusterFirst is used as default.

the purpose of pod/service DNS is used to transfer URL to IP, which is the second step after iptabels is understand successfully.

coreDNS is setted during kube-adm init with serverDNS. To do SNAT, the pod/service in K8S need access

resolv.conf/DNS inside pod :

root@redisjq:/redis# cat /etc/resolv.conf 
nameserver 10.96.0.10
search lg.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

clearly, the pod DNS is coming from cluster, which can be check by:

kubectl describe configmap kubeadm-config -n kube-system 
kind: ClusterConfiguration
kubernetesVersion: v1.18.2
networking:
  dnsDomain: cluster.local
  podSubnet: 10.4.0.0/16
  serviceSubnet: 10.96.0.0/12

and that’s the reason why resolve URL failed, as clusterDNS is kind of random defined.

the DNS failure gives errors as following inside pod:

1	socket.gaierror: [Errno -3] Temporary failure in name resolution

docker0 DNS

docker engine can configure its DNS at /etc/daemon.json, with “dns” section. when running docker in single mode, docker0 looks has host machine’s DNS, but when runnig in swarm mode, need to define additional DNS for external access.

iptables in K8S

docker and iptables

you should not modify the rules Docker inserts into your iptables policies. Docker installs two custom iptables chains named DOCKER-USER and DOCKER, and it ensures that incoming packets are always checked by these two chains first

a simple test can found, docker0 NIC is bridge is well to bind host network namespace, either export or response request externally. while with flannel.d NIC, pod can’t access external resources.

code review: kube-proxy iptables:

iptables has 5 tables and 5 chains.

the 5 chaines:

PREROUTING: before message into route, the external request DNAT.
INPUT: message to local host or current network namespace. 
FORWARD: message forward to other host or other network namespace.
OUTPUT: message export from current host
POSTROUTING: after route before NIC, message SNAT

the 5 tables:

filter table, used to control the network package need to ACCEPT, or DROP, or REJECT when it comes to a chain
nat(network address translation) table, used to modify the src/target address of network package
mangle table, used to modify IP header info of network package
raw table
security table

for k8s pods/services, mostly consider filter and nat tables. and k8s expand another 7 chains: KUBE-SERVICES、KUBE-EXTERNAL-SERVICES、KUBE-NODEPORTS、KUBE-POSTROUTING、KUBE-MARK-MASQ、KUBE-MARK-DROP、KUBE-FORWARD.

virtual network flannel

check running flanneld

/etc/cni/net.d/10-flannel.conflist on host machine is the same as /etc/kube-flannel/cni-conf.json in flannel container on master node.
/run/flannel/subnet.env exist in both flannel container (on master node) and in master host machine. it looks like network configure(subnet.env) is copied from container to host machine. so if there is no flannel container running on some nodes, these nodes won’t have the right network configure.

at master node, HostPath points to: /run/flannel, /etc/cni/net.d, kube-flannel-cfg (ConfigMap); while at working node(due to missing /gcr.io/flannel image), /run/flannel/subnet.env is missed. previously, I thought to cp this file from master node to woker node is the solution, then this file is missed every time to restart worker node.

once copied both kube-proxy and flannel images to worker node, and restart kubelet at worker node, the cluster should give Running status of all these components. including 2 copies of flannel, one running on master node, and the other running on working node.

as we are using kubectl to start the cluster, the actual flanneld is /opt/bin/flanneld from the running flannel container, and it maps NIC to the host machine.

another thing is, flannel is the core of the default kube-proxy, so kube-proxy image is also required on both nodes. coreDNS run two copies on master node.

Flannel mechanism

the data flow: the sending message go to VNC(virtual network card) docker0 on host machine, which transfers to VNC flannel0. this process is P2P. the global etcd service maintain a iptables among nodes, which store the subnet range of each node. 2) the flanneld service on the special node package the sending message as UDP package, and delivery to target node, based on the iptables. 3) when the target node received the UDP package, it unpackage the message, and send to its flannel0, then transfer to its docker0.

1) after flanneld started，will create flannel.1 virtual network card. the purpose of flannel.1 is for across-host network, including package/unpackage UDP, and maintain iptables among the nodes.

2) each node also create cni0 virtual network card, at the first time to run flannel CNI. the purpose of cni0 is same as docker0, and it’s a bridge network, used for communication in the same node.

test with redis services

we had define a redisjq pod, the following testes are all in this pod:

kubectl exec -it redisjq -n lg  /bin/bash
ping localhost #ok
ping 10.255.18.3 #not 
ping 10.3.101.101 #not
ping 10.20.180.12 
ifconfig 
>>eth0,  10.4.1.46
>>lo, 127.0.0.1

the output above is initial output before we had any network setttings. basically the pod can only ping localhost, neither the host DNS, or the host IP. the vip(10.4.1.46) is not in the same network namespace as host network space.

flannel.d pod on both nodes:

david@meng:~/k8s/lgsvl$ kubectl get pods kube-flannel-ds-amd64-85d6m  -n kube-system --output=wide
NAME                          READY   STATUS    RESTARTS   AGE   IP             NODE   NOMINATED NODE   READINESS GATES
kube-flannel-ds-amd64-85d6m   1/1     Running   5          15d   10.20.180.12   meng   <none>           <none>
david@meng:~/k8s/lgsvl$ kubectl get pods kube-flannel-ds-amd64-fflsl  -n kube-system --output=wide
NAME                          READY   STATUS    RESTARTS   AGE   IP              NODE     NOMINATED NODE   READINESS GATES
kube-flannel-ds-amd64-fflsl   1/1     Running   154        15d   10.20.181.132   ubuntu   <none>           <none>

flannel.d is runing on each node, should triggered by kubelet. flannel.d is used as virtual network interface, to manage across-node pod communication inside k8s.

coredns pod in meng node

1
2
3

david@meng:~/k8s/lgsvl$ kubectl get pods coredns-66bff467f8-59g97  -n kube-system --output=wide
NAME                       READY   STATUS    RESTARTS   AGE   IP          NODE   NOMINATED NODE   READINESS GATES
coredns-66bff467f8-59g97   1/1     Running   4          14d   10.4.0.27   meng   <none>           <none>

coredns has two replicas, both running on master node(meng), and we can see it only has virtual ip/cluster ip ( 10.4.0.x).

redisjs pod in ubuntu’s node

1
2
3

david@meng:~/k8s/lgsvl$ kubectl get pods redisjq -n lg --output=wide 
NAME      READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
redisjq   1/1     Running   0          20m   10.4.1.47   ubuntu   <none>           <none>

by default, working pod is only running at woker node(ubuntu), which has clusterIP(10.4.1.47).

pod1 ping pod2 in the same node

ping successfully there is no doubt in the same node, pods can ping each other.

redisjq pod1(10.4.1.47) in ubuntu ping corends pod2(10.4.0.27) in meng

1
2
3

root@redisjq:/redis# ping 10.4.0.27 
PING 10.4.0.27 (10.4.0.27) 56(84) bytes of data.
64 bytes from 10.4.0.27: icmp_seq=1 ttl=62 time=0.757 ms

ping successfully, which is the working of flanneld.

redisjq pod1(10.4.1.7) in ubuntu ping hostIP(10.20.181.132) of ubuntu

1
2
3

root@redisjq:/redis# ping 10.20.181.132
PING 10.20.181.132 (10.20.181.132) 56(84) bytes of data.
64 bytes from 10.20.181.132: icmp_seq=1 ttl=64 time=0.127 ms

ping successfuly, pod in cluster can ping its host node, sounds no problem.

redisjq pod1(10.4.1.7) in ubuntu ping hostIP(10.20.180.12) of meng

1 2	root@redisjq:/redis# ping 10.20.180.12 PING 10.20.180.12 (10.20.180.12) 56(84) bytes of data.

ping failed, interesting, so pod in cluster can’t ping any non-hosting node’s IP.

so far, pod with vip can ping any other pod with vip in the cluster, no matter in the same node or not. pod with vip can only ping its host machine’s physical IP, but pod can’t ping other hostIP.

namely, the network of pod VIP inside k8s and the bridge network from pod vip to its host is set well. but the network from pod to external IP is not well.

these 4 tests give a very good understanding about flannel’s function inside k8s: pod to pod in the same node or not. but usually we need SNAT or DNAT. to make SNAT/DNAT avaiable, we need understand DNS & iptables of k8s.

update iptables to allow pod access public IP

cni0, docker0, eno1, flannel.1 in host machine vs eth0 in pod

these virtual NIC are common in k8s env.

on node1

cni0: 10.4.1.1
docker0: 172.17.0.1
eno1: 10.20.181.132
flannel.1: 10.4.1.0

on pod1, which is running on node1

1	eth0: 10.4.1.48

pod1 -> pod2 network message flow

pod1(10.4.1.48) on node1(10.20.181.132) -> cni0(10.4.1.1) -> flannel.1(10.4.1.0) -> kube-flannel on node1(10.20.181.132) -> kube-flannel on node2(10.20.180.12) -> flannel.1 on node2 -> cni0 on node2 -> pod2(10.4.1.46) on node2

to allow network message SNAT, namely to handle FORWARD internal clusterIP data to external services, we can add the following newe iptable rule:

1	iptables -t nat -I POSTROUTING -s 10.4.1.0/24 -j MASQUERADE

after add the new rule, check inside pod:

root@redisjq:/redis# ping 10.20.180.12 
PING 10.20.180.12 (10.20.180.12) 56(84) bytes of data.
64 bytes from 10.20.180.12: icmp_seq=1 ttl=62 time=0.690 ms
root@redisjq:/redis# ping 10.20.181.132
PING 10.20.181.132 (10.20.181.132) 56(84) bytes of data.
64 bytes from 10.20.181.132: icmp_seq=1 ttl=64 time=0.108 ms
root@redisjq:/redis# ping 10.20.180.61 
PING 10.20.180.61 (10.20.180.61) 56(84) bytes of data.
64 bytes from 10.20.180.61: icmp_seq=1 ttl=126 time=0.366 ms
root@redisjq:/redis# ping www.baidu.com
ping: unknown host www.baidu.com
root@redisjq:/redis# ping 61.135.169.121   #baidu IP
PING 61.135.169.121 (61.135.169.121) 56(84) bytes of data.
64 bytes from 61.135.169.121: icmp_seq=1 ttl=51 time=8.16 ms

the DNS is not fixed, so we can’t ping www.baidu.com, but we can ping its IP.

on the other hand, to hanle FORWARD external request to internal clusterIP, we can add the following new iptable rule:

1	iptables -t nat -I PREROUTING -d 10.4.1.0/24 -j MASQUERADE

that’s beauty of iptables.

as mentioned previously, to handle pod DNS error, we need add pod/service DNS strategy inside the pod.yaml:

1 2	spec: dnsPolicy: Default

our k8s cluster has none DNS server itself, so to do SNAT/DNAT, we have to keep Default dns strategy, which make pod/service to use its host machine’s DNS, which is defined at /etc/resovl.conf.

one thing to take care, some host machine has only nameserver 127.0.0.1 in resolv.conf, then we need add the real DNS server.