Migrating Longhorn from ZFS to Multi Disk Deployment

June 28, 2023 - 7 mins read

Current setup

The current Longhorn setup consists of four nodes with the following storage setup, four 8Tb HDDs, 1 Tb NVME SSD in a RAIDz array with the 1Tb SSD acting as a cache. Due to this design each node has an array with 24Tb of available storage, however after creating a zvol partitioned with ext4 the max size available is 14Tb. Furthermore Longhorn takes 20% by default as reserved space to prevent DiskPressure from causing failures to provision. After all is said and done we go from 32Tb of storage to 10.6Tb per node. This is a loss of 67% of my raw storage capacity.

As much as I love ZFS this is not worth the loss of storage. This could be remedied by reducing the Storage Minimal Available Percentage value, however this would only free up (at maximum) 4.5 Tb per node.

The OOP

Evict 2 of 4 nodes
Destroy zpools
Reformat drives with ext4
Re-add disks to Longhorn
Label disks
Deploy/edit existing StoragClasses
Relabel all volumes
Evict other half (and repeat)

Evicting nodes

To begin with I select two nodes with the least amount of data and began the eviction process. This is simple, log in to the Web UI, find the node you would like to evict, select edit, and finally select ‘Disable’ for ‘Node Scheduling’ and ‘True’ for ‘Eviction Requested’. This will begin evicting the nodes data, this may take some time depending on how much data is on the node and the speed of the other nodes.

Destroying old ZFS pools

Before destroying your ZFS pools ensure ALL replicas have been removed.
To start any zvol mounts will need to be removed from /etc/fstab. The node status will stay disabled in longhorn post reboot, so after the zvol has been removed from fstab simply reboot your node. Post reboot you may now delete your zpool, in my case this is rke2 and can be done like so:

❯ sudo zpool destroy rke2

Redeploying disks

The pools of disks can now be formatted with EXT4, I have a particular way of mounting and referencing these disks and this is by no means a requirement. First I will begin reformatting all the disks with EXT4, you can simply overwrite the existing filesystems, example:

Warning:

The below code snippet is dangerous, do not run without understanding what is happening

for disk in sda sdb sdc sdd nvme1n1; do echo "y" | mkfs.ext4 /dev/$disk; done

Now that these have all been formatted we can get their UUIDs

for disk in sda sdb sdc sdd nvme1n1; do blkid -s UUID -o value /dev/$disk; done

and begin editing fstab. Out of personal preference I like to mount the disks in /mnt/ to a folder with the same UUID, below is a quick example of mounting the disks and adding them to fstab, example:

for disk in sda sdb sdc sdd nvme1n1; do \
  mkdir /mnt/$(sudo blkid -s UUID -o value /dev/$disk); \
  mount /dev/$disk /mnt/$(sudo blkid -s UUID -o value /dev/$disk); \
  echo "UUID=$(blkid -s UUID -o value /dev/$disk) /mnt/$(blkid -s UUID -o value /dev/$disk) ext4 defaults 1 2" >> /etc/fstab
done

Now we can return to the Longhorn UI for this node and begin adding our disks and remove the old disk for /var/lib/rancher. When adding disks ensure you add any tags. In my case I will be adding ssd and hdd to my disks respectively.

After saving you should see the node update its total available storage after a few seconds.

This now needs to be done on the second node.

New Storage classes

Now it is time to create two new StorageClasses and edit the current default StorageClass.

ConfigMap

The default StorageClass comes as a ConfigMap. The change is the addition of diskSelector: "default" to the ConfigMap.

---
# Source: longhorn/templates/storageclass.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: longhorn-storageclass
  namespace: longhorn-system
  labels:
    app.kubernetes.io/name: longhorn
    app.kubernetes.io/instance: longhorn
    app.kubernetes.io/version: v1.4.1
data:
  storageclass.yaml: |
    kind: StorageClass
    apiVersion: storage.k8s.io/v1
    metadata:
      name: longhorn
      annotations:
        storageclass.kubernetes.io/is-default-class: "true"
    provisioner: driver.longhorn.io
    allowVolumeExpansion: true
    reclaimPolicy: "Delete"
    volumeBindingMode: Immediate
    parameters:
>     diskSelector: "default"
      numberOfReplicas: "2"
      staleReplicaTimeout: "30"
      fromBackup: ""
      fsType: "ext4"
      dataLocality: "disabled"

StorageClasses

I will now be adding two more StorageClasses, one for the HDDs and one for the SSDs.

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-slow
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: "Delete"
volumeBindingMode: Immediate
parameters:
  diskSelector: "hdd"
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"
  fromBackup: ""
  fsType: "ext4"
  dataLocality: "disabled"
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-fast
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: "Delete"
volumeBindingMode: Immediate
parameters:
  diskSelector: "ssd"
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"
  fromBackup: ""
  fsType: "ext4"
  dataLocality: "disabled"

Edit existing volumes

There is currently no way in the LongHorn UI to adjust volume tags and I am far too lazy to edit all of them by hand.

❯ kubectl get volumes -n longhorn-system             
NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE            NODE        AGE
pvc-0078be9d-1f6b-445a-ad22-acaba5cc95c9   attached   healthy                  10737418240     worker-03   28d
pvc-09014301-c264-4e6c-90c9-dcd54e8d37f3   attached   healthy                  21474836480     worker-02   27d
pvc-093951ee-0600-4fce-8f8a-bb3e75022c8c   attached   healthy                  53687091200     worker-02   27d
pvc-280d7773-111f-4a3b-8b0d-2b0337599575   attached   healthy                  53687091200     worker-01   14d
pvc-3d51540a-53a9-432f-837d-1bb790efcdc5   attached   healthy                  10737418240     worker-03   27d
pvc-3f1df940-6f4e-4a34-9a19-a9b1303b5fe1   attached   healthy                  10737418240     worker-03   28d
pvc-48f2680f-6b32-4d3a-b863-cd78e136d600   attached   healthy                  10737418240     worker-02   27d
pvc-511b50d9-7fc7-48e8-b614-942591506229   attached   healthy                  10737418240     worker-04   27d
pvc-53b3aa84-77e7-4cef-9b24-d93bafd97de8   attached   healthy                  8589934592      worker-01   28d
pvc-566eff31-cd13-4ed6-a80f-def3aa88f539   attached   healthy                  21474836480     worker-04   14d
pvc-62dae2d3-e71a-4135-9fef-62a6e8d264d6   attached   healthy                  10737418240     worker-02   27d
pvc-649cdec0-cc19-4c5e-87d0-a5609cfe723e   attached   healthy                  21474836480     worker-04   28d
pvc-6f2f83e9-45ee-4abe-bb61-dbda0b1730cd   attached   healthy                  1099511627776   worker-01   25d
pvc-71a77ff4-2882-4ae7-979a-111ed6d45ec7   detached   unknown                  5368709120                  28d
pvc-81dc925e-f94e-4446-919e-4511026aa2f2   attached   healthy                  21474836480     worker-04   27d
pvc-83c51034-de8f-432c-b854-af8e1f5c882e   attached   healthy                  53687091200     worker-02   27d
pvc-8888dc52-c01a-4df5-ba0b-7fe82c47f375   attached   healthy                  10737418240     worker-03   27d
pvc-969dc30e-26e5-4c1c-9d5f-3084af4d5601   attached   healthy                  53687091200     worker-02   17d
pvc-989a81af-1405-4cbe-b3dc-c30f21e91aa4   attached   healthy                  5368709120      worker-03   28d
pvc-99d8ed79-7f43-449f-9375-9ec8be03c526   attached   healthy                  53687091200     worker-03   4d3h
pvc-ad3c25d2-5c02-4d9f-9cbd-3ffc77a8a5ab   attached   healthy                  1099511627776   worker-01   18d
pvc-b757ec25-5da5-418b-ae04-9263247a6f18   attached   healthy                  536870912000    worker-01   27d
pvc-bbbafdaa-fd06-48bf-8ad2-930e1746b30b   attached   healthy                  1073741824      worker-02   28d
pvc-c8d97dd3-1b08-407f-99b4-5a1140c1dfba   attached   healthy                  53687091200     worker-01   27d
pvc-cfb0a484-56f8-4006-be2c-cc478a88d588   attached   healthy                  21474836480     worker-03   28d
pvc-d3795c65-23cc-4b73-90ba-38d0510a4312   attached   healthy                  10737418240     worker-03   27d
pvc-d9754a64-427b-4a75-9083-6b0cbb6bea59   attached   healthy                  10737418240     worker-02   27d
pvc-db1c36e0-0ebc-468f-9e9d-aac68fe0b790   attached   healthy                  53687091200     worker-02   24d
pvc-e02fd7e8-be31-4e0c-8f06-d9fc0b2dd67d   attached   healthy                  53687091200     worker-03   28d
pvc-e072d250-b09a-421e-b30b-c1044754609c   attached   healthy                  21474836480     worker-04   27d
pvc-e2db04cf-637d-472b-ad6a-0fe4eb719e9c   attached   healthy                  10737418240     worker-02   28d
pvc-ea85bb6f-f769-4e5e-96b0-7324463332d1   attached   healthy                  21474836480     worker-02   27d
pvc-f1ef7d13-ec2c-4faa-a10e-ecbe081463b8   attached   healthy                  10737418240     worker-03   4d3h
pvc-fcad45cd-38d5-4004-9521-c133ffe14a68   attached   healthy                  10737418240     worker-02   27d
pvc-fe65be2c-664c-4908-b966-382d7bcf4c68   attached   healthy                  10737418240     worker-01   27d

By default I want all volumes to move to HDDs as this is the bulk of available storage. First all volumes will need their diskSelector values set to hdd.

for vol in $(kubectl get volumes -n longhorn-system | tail -n +2 | awk '{ print $1}'); do
kubectl -n longhorn-system patch volumes.longhorn.io $vol --type=merge -p '{"spec":{"diskSelector":["hdd"]}}'
done

this will patch all available volumes, I will also now need to begin degrading the volumes by reducing their replica counts to one.

for vol in $(kubectl get volumes -n longhorn-system | tail -n +2 | awk '{ print $1}'); do
kubectl -n longhorn-system patch volumes.longhorn.io $vol --type=merge -p '{"spec":{"numberOfReplicas":1}}'
done

After updating the replica count go back to the UI, select all volumes, from the dropdown Update Replicas Auto Balance and set it to “Best-Effort” this will force duplicate replicas to be destroyed.

Warning: At this stage, you have only 1 volume available. Your data is at risk in this stage.

Ensure your two nodes are now schedulable and from your CLI update the replica count back to 2.

for vol in $(kubectl get volumes -n longhorn-system | tail -n +2 | awk '{ print $1}'); do
kubectl -n longhorn-system patch volumes.longhorn.io $vol --type=merge -p '{"spec":{"numberOfReplicas":2}}'
done

This forces the creation of replicas on the properly tagged storage. After the data has been properly rebuilt, we should be back to having two replicas for all nodes. Now the process can simply repeat for the last two nodes.

Conclusion

The process was fairly straightforward, however there is one loose end I am unsure how to tidy up. The previous PV/C manifests still reference the default StorageClass, this can’t be changed without creating a new PV/C and copying data from the old to the new. This can be done but would be exceptionally tedious.