现象
机房断电,通电后,K8S集群的某些Node一直处于NotReady状态。
查看kubelet日志
到Node上,查看Kubelet日志:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
$ journalctl -u kubelet
Sep 01 09:24:20 node30 kubelet[1399]: I0901 09:24:20.953854 1399 csi_plugin.go:945] Failed to contact API server when waiting for CSINode publishing: Unauthorized
Sep 01 09:24:20 node30 kubelet[1399]: E0901 09:24:20.960125 1399 controller.go:136] failed to ensure node lease exists, will retry in 3.2s, error: Unauthorized
Sep 01 09:24:21 node30 kubelet[1399]: E0901 09:24:21.051942 1399 kubelet.go:2267] node "node30" not found
Sep 01 09:24:21 node30 kubelet[1399]: E0901 09:24:21.152071 1399 kubelet.go:2267] node "node30" not found
Sep 01 09:24:21 node30 kubelet[1399]: I0901 09:24:21.152502 1399 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
Sep 01 09:24:21 node30 kubelet[1399]: I0901 09:24:21.152782 1399 setters.go:77] Using node IP: "210.45.193.149"
Sep 01 09:24:21 node30 kubelet[1399]: I0901 09:24:21.177629 1399 kubelet_node_status.go:486] Recording NodeHasSufficientMemory event message for node node30
Sep 01 09:24:21 node30 kubelet[1399]: I0901 09:24:21.177658 1399 kubelet_node_status.go:486] Recording NodeHasNoDiskPressure event message for node node30
Sep 01 09:24:21 node30 kubelet[1399]: I0901 09:24:21.177666 1399 kubelet_node_status.go:486] Recording NodeHasSufficientPID event message for node node30
Sep 01 09:24:21 node30 kubelet[1399]: I0901 09:24:21.177686 1399 kubelet_node_status.go:70] Attempting to register node node30
Sep 01 09:24:21 node30 kubelet[1399]: E0901 09:24:21.179746 1399 kubelet_node_status.go:92] Unable to register node "node30" with API server: Unauthorized
Sep 01 09:24:21 node30 kubelet[1399]: E0901 09:24:21.184238 1399 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.CSIDriver: Unauthorized
Sep 01 09:24:21 node30 kubelet[1399]: E0901 09:24:21.252222 1399 kubelet.go:2267] node "node30" not found
Sep 01 09:24:21 node30 kubelet[1399]: E0901 09:24:21.340245 1399 reflector.go:178] k8s.io/kubernetes/pkg/kubelet/kubelet.go:526: Failed to list *v1.Node: Unauthorized
Sep 01 09:24:21 node30 kubelet[1399]: E0901 09:24:21.352382 1399 kubelet.go:2267] node "node30" not found
|
注意到关键日志:
1
|
Sep 01 09:24:21 node30 kubelet[1399]: E0901 09:24:21.179746 1399 kubelet_node_status.go:92] Unable to register node "node30" with API server: Unauthorized
|
意思是Node无法注册到Api Server,初步判断是这个原因导致的Node一直处于NotReady状态。
重启kubelet
1
|
systemctl restart kubelet
|
问题依旧。
查资料
查询资料,没有找到类似问题,也没有解决办法。
探究为何与Api Server通信被拒绝
1
|
Sep 01 09:24:21 node30 kubelet[1399]: E0901 09:24:21.179746 1399 kubelet_node_status.go:92] Unable to register node "node30" with API server: Unauthorized
|
日志的意思上看,是kubelet和Api Server通信的时候被拒绝了,拒绝原因是未授权。
那么kubelet的和Api Server的通信配置在哪里呢?
1
2
3
4
5
6
7
8
9
10
|
$ ps -ef | grep kubelet
/usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=...
--hostname-override=node30
--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
--config=/etc/kubernetes/kubelet-config.yaml
--kubeconfig=/etc/kubernetes/kubelet.conf
--pod-infra-container-image=harbor.supwisdom.com/gcr-image/pause:3.2
--runtime-cgroups=/systemd/system.slice
--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin
|
注意到--kubeconfig=/etc/kubernetes/kubelet.conf
,查看这个文件内容:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: [Base64]
server: https://localhost:6443
name: default-cluster
contexts:
- context:
cluster: default-cluster
namespace: default
user: default-auth
name: default-context
current-context: default-context
kind: Config
preferences: {}
users:
- name: default-auth
user:
client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
|
对比其他好的Node的certificate-authority-data
字段,发现是一样的。
再对比其他好的Node的/var/lib/kubelet/pki/kubelet-client-current.pem
,发现是不一样的,问题可能出在这个文件里。推测每个Node作为一个独立的Api Server客户端有自己独特的证书。
查看kubelet-client-current.pem
这个文件是一个x509证书,显然Api Server用来做客户端认证的,kubelet也是Api Server的客户端,看一下这个文件。
1
2
3
4
5
6
|
$ ls -l /var/lib/kubelet/pki/
-rw-------. 1 root root 1090 Nov 6 2020 kubelet-client-2020-11-06-02-48-49.pem
-rw-------. 1 root root 1090 Sep 1 2021 kubelet-client-2021-09-01-13-21-24.pem
lrwxrwxrwx. 1 root root 59 Sep 1 2021 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2021-09-01-13-21-24.pem
-rw-r--r--. 1 root root 2279 Nov 6 2020 kubelet.crt
-rw-------. 1 root root 1679 Nov 6 2020 kubelet.key
|
发现这个文件是一个软连接,指向的是kubelet-client-2021-09-01-13-21-24.pem
,但笔者在排查问题时的时间是9月1日早上11点左右,这个文件显然是超前了。
再查看这个文件内容:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
openssl x509 -in kubelet-client-2021-09-01-13-21-24.pem -noout -text
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
[hex block]
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN = kubernetes
Validity
Not Before: Sep 1 05:11:39 2021 GMT
Not After : Sep 1 05:11:39 2022 GMT
Subject: O = system:nodes, CN = system:node:node30.eams.supwisdom.com
Subject Public Key Info:
Public Key Algorithm: id-ecPublicKey
Public-Key: (256 bit)
pub:
[hex block]
ASN1 OID: prime256v1
NIST CURVE: P-256
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Client Authentication
X509v3 Basic Constraints: critical
CA:FALSE
Signature Algorithm: sha256WithRSAEncryption
[hex block]
|
发现证书有效期从 2021年9月1日13:11:39 ~2022年9月1日13:11:39(GMT转换成东八区),这个是不对的。
解决办法
因为同目录下还有一个 kubelet-client-2020-11-06-02-48-49.pem
,所以尝试把软连接指向这个文件,然后再启动kubelet试试。
1
2
3
|
$ rm /var/lib/kubelet/pki/kubelet-client-current.pem
$ ln -s /var/lib/kubelet/pki/kubelet-client-2020-11-06-02-48-49.pem /var/lib/kubelet/pki/kubelet-client-current.pem
$ systemctl restart kubelet
|
问题解决,Node恢复成Ready状态。
后续
因为Node上的还有断电前的容器,为了把Node清理干净,把kubelet停止,删除所有容器,再启动kubelet。
评论