background
k8s networks include a few topics:
pod to pod communication in k8s
pod to the host node communication
pod to external service URL
external request to pod in k8s
the networks have two components: DNS and iptables. DNS used to resovle URL name to IP; iptables used to control network message transfer in/out.
preparation image for test
to test network inside pod/docker, we need add the following network tools:
|
|
docker/pod runtime privileges
by default, docker doesn’t allow run iptables
inside container. and it give errors:
|
|
which need to add docker runtime privilege and Linux capabilities
In a Kubernetes pod, the names are the same, but everything has to be defined in the pod specification. When implementing this in Kubernetes, you add an array of capabilities under the securityContext tag.
|
|
k8s pod DNS
DNS for services and pods introduced four types:
None
Default, where POD derived DNS config from the host node where to run pod.
ClusterFirst, where POD use DNS info from kube-dns or coreDNS
ClusterFirstWithHostNet, as the name explained.
tips, Default is not the default DNS policy. If dnsPolicy is not explicitly specified, then ClusterFirst is used as default.
the purpose of pod/service DNS is used to transfer URL to IP, which is the second step after iptabels is understand successfully.
coreDNS
is setted during kube-adm init
with serverDNS
. To do SNAT, the pod/service in K8S need access
resolv.conf/DNS inside pod :
|
|
clearly, the pod DNS is coming from cluster, which can be check by:
|
|
and that’s the reason why resolve URL failed, as clusterDNS is kind of random defined.
the DNS failure gives errors as following inside pod:
|
|
docker0 DNS
docker engine can configure its DNS at /etc/daemon.json
, with “dns” section. when running docker in single mode, docker0 looks has host machine’s DNS, but when runnig in swarm mode, need to define additional DNS for external access.
iptables in K8S
you should not modify the rules Docker inserts into your iptables policies. Docker installs two custom iptables chains named DOCKER-USER
and DOCKER
, and it ensures that incoming packets are always checked by these two chains first
a simple test can found, docker0
NIC is bridge is well to bind host network namespace, either export or response request externally. while with flannel.d
NIC, pod can’t access external resources.
code review: kube-proxy iptables:
iptables has 5 tables and 5 chains.
the 5 chaines:
|
|
the 5 tables:
|
|
for k8s pods/services, mostly consider filter
and nat
tables. and k8s expand another 7 chains: KUBE-SERVICES、KUBE-EXTERNAL-SERVICES、KUBE-NODEPORTS、KUBE-POSTROUTING、KUBE-MARK-MASQ、KUBE-MARK-DROP、KUBE-FORWARD.
virtual network flannel
check running flanneld
/etc/cni/net.d/10-flannel.conflist
on host machine is the same as/etc/kube-flannel/cni-conf.json
in flannel container on master node./run/flannel/subnet.env
exist in both flannel container (on master node) and in master host machine. it looks like network configure(subnet.env) is copied from container to host machine. so if there is noflannel container
running on some nodes, these nodes won’t have the right network configure.
at master node, HostPath
points to: /run/flannel, /etc/cni/net.d, kube-flannel-cfg (ConfigMap); while at working node(due to missing /gcr.io/flannel image), /run/flannel/subnet.env
is missed. previously, I thought to cp this file from master node to woker node is the solution, then this file is missed every time to restart worker node.
once copied both kube-proxy
and flannel
images to worker node, and restart kubelet
at worker node, the cluster should give Running status
of all these components. including 2 copies of flannel
, one running on master node, and the other running on working node.
as we are using kubectl
to start the cluster, the actual flanneld is /opt/bin/flanneld
from the running flannel container, and it maps NIC to the host machine.
another thing is, flannel
is the core of the default kube-proxy
, so kube-proxy
image is also required on both nodes. coreDNS
run two copies on master node.
Flannel mechanism
the data flow: the sending message go to VNC(virtual network card) docker0
on host machine, which transfers to VNC flannel0
. this process is P2P. the global etcd
service maintain a iptables among nodes, which store the subnet range of each node. 2) the flanneld
service on the special node package the sending message as UDP package, and delivery to target node, based on the iptables. 3) when the target node received the UDP package, it unpackage the message, and send to its flannel0
, then transfer to its docker0
.
1) after flanneld started,will create flannel.1
virtual network card. the purpose of flannel.1
is for across-host network, including package/unpackage UDP, and maintain iptables among the nodes.
2) each node also create cni0
virtual network card, at the first time to run flannel CNI. the purpose of cni0
is same as docker0
, and it’s a bridge network, used for communication in the same node.
test with redis services
we had define a redisjq
pod, the following testes are all in this pod:
|
|
the output above is initial output before we had any network setttings. basically the pod can only ping localhost, neither the host DNS, or the host IP. the vip(10.4.1.46) is not in the same network namespace as host network space.
flannel.d pod on both nodes:
|
|
flannel.d
is runing on each node, should triggered by kubelet
. flannel.d
is used as virtual network interface, to manage across-node pod communication inside k8s.
coredns pod in meng node
|
|
coredns
has two replicas, both running on master node(meng), and we can see it only has virtual ip/cluster ip ( 10.4.0.x).
redisjs pod in ubuntu’s node
|
|
by default, working pod is only running at woker node(ubuntu), which has clusterIP(10.4.1.47).
pod1 ping pod2 in the same node
ping successfully there is no doubt in the same node, pods can ping each other.
redisjq pod1(10.4.1.47) in ubuntu ping corends pod2(10.4.0.27) in meng
|
|
ping successfully, which is the working of flanneld.
redisjq pod1(10.4.1.7) in ubuntu ping hostIP(10.20.181.132) of ubuntu
|
|
ping successfuly, pod in cluster can ping its host node, sounds no problem.
redisjq pod1(10.4.1.7) in ubuntu ping hostIP(10.20.180.12) of meng
|
|
ping failed, interesting, so pod in cluster can’t ping any non-hosting node’s IP.
so far, pod with vip can ping any other pod with vip in the cluster, no matter in the same node or not. pod with vip can only ping its host machine’s physical IP, but pod can’t ping other hostIP.
namely, the network of pod VIP inside k8s and the bridge network from pod vip to its host is set well. but the network from pod to external IP is not well.
these 4 tests give a very good understanding about flannel’s function inside k8s: pod to pod in the same node or not. but usually we need SNAT or DNAT. to make SNAT/DNAT avaiable, we need understand DNS & iptables of k8s.
update iptables to allow pod access public IP
cni0, docker0, eno1, flannel.1 in host machine vs eth0 in pod
these virtual NIC are common in k8s env.
- on node1
|
|
- on pod1, which is running on node1
|
|
pod1 -> pod2 network message flow
|
|
to allow network message SNAT, namely to handle FORWARD internal clusterIP data to external services, we can add the following newe iptable rule:
|
|
after add the new rule, check inside pod:
|
|
the DNS is not fixed, so we can’t ping www.baidu.com
, but we can ping its IP.
on the other hand, to hanle FORWARD external request to internal clusterIP, we can add the following new iptable rule:
|
|
that’s beauty of iptables.
as mentioned previously, to handle pod DNS error, we need add pod/service DNS strategy inside the pod.yaml
:
|
|
our k8s cluster has none DNS server itself, so to do SNAT/DNAT, we have to keep Default
dns strategy, which make pod/service to use its host machine’s DNS, which is defined at /etc/resovl.conf
.
one thing to take care, some host machine has only nameserver 127.0.0.1
in resolv.conf
, then we need add the real DNS server.
summary
with knowledge about iptables and dns, we can make an useful K8S cluster. the left work is make useful pods.
refere
jimmy song: config K8S DNS: kube-dns