Troubleshooting Longhorn CSI CrashLoopBackOff
TL;DR
If you see the following error repeating in a longhorn CSI plugin pod causing a CrashLoopBackOff, try disabling SELinux and restarting the pod. If the pod is able to connect to csi.sock
you found your problem.
Still connecting to unix:///csi/csi.sock
The Problem
I recently have deployed Longhorn to my Kubernetes lab cluster. This has been a departure from my previous build of using Ceph, as expected there was bound to be an issue or two. Overall the deployment went well, however I have been experiencing issues with the longhorn-csi-plugin
pods entering a CrashLoopBackoof state. Viewing the logs shows the following:
longhorn-csi-plugin time="2023-05-24T00:22:02Z" level=info msg="Enabling volume access mode: MULTI_NODE_MULTI_WRITER"
longhorn-csi-plugin time="2023-05-24T00:22:02Z" level=info msg="Listening for connections on address: &net.UnixAddr{Name:\"//csi/csi.sock\", Net:\"unix\"}"
longhorn-csi-plugin time="2023-05-24T00:22:02Z" level=info msg="GetPluginInfo: req: {}"
longhorn-csi-plugin time="2023-05-24T00:22:02Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.4.1\"}"
longhorn-csi-plugin time="2023-05-24T00:22:03Z" level=info msg="NodeGetInfo: req: {}"
longhorn-csi-plugin time="2023-05-24T00:22:03Z" level=info msg="NodeGetInfo: rsp: {\"node_id\":\"worker-04\"}"
Stream closed EOF for longhorn-system/longhorn-csi-plugin-6kvc6 (longhorn-csi-plugin)
longhorn-liveness-probe W0524 00:22:21.947475 1290307 connection.go:173] Still connecting to unix:///csi/csi.sock
The Process
Some google research led me down the path of this being a networking issue, exactly how this would be a networking issue I was unsure, however I happened across a fairly well written blog post with the same error.
There are a decent number of GitHub issues also describing this as a networking issue. As such I went down the rabbit hole and attempted to find what may be causing the issue in my setup.
To start my cluster is 5 bare metal servers running Rocky for the OS, and RKE2 with SELinux enabled (foreshadowing) for K8s. Taking this into account, it seemed possible I had ran into an issue with networking, all nodes have statically assigned IPs in the same subnet Metal-LB was serving its address pools from, and only some nodes had failing CSI pods.
However after some double and triple checking there was no overlap, all nodes were correctly assigned their IPs, gateways, FirewallD is disabled, etc. At this point I was out of ideas for what could be causing this failure, the fact it was only impacting some nodes at least implied this issue was related to networking but I could not find a reason for why that would be.
As I tend to do at 3 AM on a work night, I told myself screw it, and started whacking canal and CoreDNS. I assumed it was possible DNS entries in some pod in the CNI process was not correctly updating, and that by whacking them I may end up with a working cluster in the morning. Of course, I did not, so now I am fully unconvinced this is a networking question and I had to ask myself:
To be honest up until this point I had forgot I was running RKE2 with SELinux enabled, it basically just works (TM), and that’s where I found the error:
~ ❯ sudo cat /var/log/audit/audit.log | grep -m1 denied ✔ PIPE|0 jhanafin@worker-04
type=AVC msg=audit(1684812509.994:5159): avc: denied { connectto } for pid=271290 comm="livenessprobe" path="/csi/csi.sock" scontext=system_u:system_r:container_t:s0:c424,c768 tcontext=system_u:system_r:unconfined_service_t:s0 tclass=unix_stream_socket permissive=0
Of course it was SELinux. As a quick test, I temporarily disabled SELinux with setenforce 0
restarted the CSI pods and no more crashes.
The Docs
I was unable to find any documentation on this issue or recommendations from Longhorn on their site. All in all this may not be an issue/bug but the lack of documentation for deployments on SELinux enabled OS’ is a problem. Given Rancher is also responsible for the production of RKE2, their next generation K8s platform, it is a given more and more users will deploy on SELinux enabled systems.
The Fix
You will need to create the following type enforcement file and install it.
module longhorn-csi 1.0;
require {
type unconfined_service_t;
type container_t;
class unix_stream_socket connectto;
}
#============= container_t ==============
allow container_t unconfined_service_t:unix_stream_socket connectto;
After this has been installed you should be good to go with running Longhorn on an SELinux enabled system.
Note: Do keep in mind I am not an SELinux master, this is taken from the output of audit2allow
I am sure something more restrictive could be made but that’s a job best left for someone competent.
For the truly lazy among us
cat <<EOF > longhorn-csi.te
module longhorn-csi 1.0;
require {
type unconfined_service_t;
type container_t;
class unix_stream_socket connectto;
}
#============= container_t ==============
allow container_t unconfined_service_t:unix_stream_socket connectto;
EOF
checkmodule -M -m -o longhorn-csi.mod ./longhorn-csi.te
semodule_package -o longhorn-csi.pp -m ./longhorn-csi.mod
semodule -i ./longhorn-csi.pp
Sources
Post blog notes
After writing the draft for this post I came across the exact GitHub issue that would have helped me earlier on in the process, my google-fu failed me. Below is the issue, hopefully the documentation will be updated as a result of this.