Deploying Harvester (Part 3): The Solution

September 10, 2024 - 14 mins read

Summary

If you have read the previous two posts in this series you will know the migration to Harvester has had its speed bumps. Most of the issues had come down to using too many experimental features and deploying the cluster in an unsupported fashion. At this point it seems it would be best to completely redo not just the physical nodes but also make tweaks to the VMs. The goal for this redeployment will be the following:

Redeploy Rancher MCM to a separate physical cluster
Redeploy Harvester to remaining nodes
Build custom VM templates (without XFS)

Redeploying Rancher Cluster

Device Setup

The Rancher cluster will follow the typical deployment model for Rancher MCM. Three NUCs were setup with Rocky 9.3, and now I have the following L1/2 setup:
Stage01

Rancher Cluster Setup

After all three nodes were reimaged, RKE2 v1.27.15+rke2r1 was installed via Ansible using the rancherfederal/rke2-ansible playbook. As always I default to Cilium CNI with kube-proxy replacement enabled. It is fairly easy to add some config values to RKE2 via the Ansible playbook. All examples below fall under the rke2_config key:

Defining S3 ETCd Backups

While I do believe using an S3 backup of ETCd is overkill in my case (I mean whats the chance all three nodes become unrecoverable?) I still see value in doing so as Backblaze is free if you use less than 10 Gigabytes, and I may also end up using S3 for CSI backups later.

rke2_config:
  etcd-snapshot-retention: 10
  etcd-snapshot-schedule-cron: '0 */4 * * *'
  etcd-s3: true
  etcd-s3-endpoint: ""
  etcd-s3-access-key: ""
  etcd-s3-secret-key: ""
  etcd-s3-bucket: ""
  etcd-s3-region: ""
  etcd-s3-folder: ""

Basic Security

As I am running Rocky and I do not disable SELinux the flag below is needed, also I set the kubeconfig mode to 600. The should probably be default but hey, at least its an easy fix. I did not enable any CIS profile.

rke2_config:
  selinux: true
  write-kubeconfig-mode: 600

Set TLS SAN

I opt for signing the TLS key with every possible SAN I may end up using, so this includes the hostname of each node, and the IP (including the kube-vip IP and hostname).

rke2_config:
  tls-san:
    - rancher-01.infra.lan
    - 10.0.0.10
    - rancher-02.infra.lan
    - 10.0.0.11
    - rancher-03.infra.lan
    - 10.0.0.12
    # Rancher cluster kube-vip
    - rke2.lab.lan
    - 10.0.0.29

Defining Cilium

In order for this to work there will be a period where the playbook will break. With the settings below RKE2 server will never show ready, this is because the default Cilium Helm chart needs a few setting passed to it. Run the playbook with the values below, do note Ansible will fail:

rke2_config:
  cni: 
    - cilium
  disable-kube-proxy: true

When Ansible has failed you will need to manually drop a file onto the first server, this will be a helm values chart that will resolve the issue of the server node not reaching a ready state. Place the following file in /var/lib/rancher/rke2/server/manifests/

---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    kubeProxyReplacement: strict
    k8sServiceHost: 127.0.0.1
    k8sServicePort: 6443
    cni:
      chainingMode: "none"

Now re/start the rke2-server service and it should come online, when it does rerun the Ansible playbook to finish installing RKE2 to the rest of the nodes. The Ansible repository in question does have a rather large update coming that should resolve the need to manually place this file on the host.

All three nodes are master nodes with the NoSchedule taint removed, and I configured the cluster to run the following:

kube-vip (for k8s API HA)
Metal-LB (for ingress HA)
Rancher MCM
Traefik (for ingress, this is just habit at this point)

Note:

In the future I do intend on looking into more of Cilium’s features so I can cut out the need for an ingress like Traefik or NGINX, and cut out the need for a loadbalancer like Metal-LB. Cilium has a lot of features I have not yet been able to dive into but the idea of having my CNI take the role of the LB and ingress is very exciting.

Redeploying Harvester Cluster

Device Setup

All Harvester nodes were reimaged, this time with Harvester 1.3.1 which included some fixes from the previous version that I ran into. This time around I also opted to forgo using Harvester manifests. The manifest configs for Harvester are a good idea but I no longer had somewhere to pull the configs from and I opted not to upload them to GitHub. Currently the configs don’t save you that much time especially in such a small environment.
Now I have the following L1/2 setup: Stage02
It should be noted Harvester recommends the management interface run on a 10 Gig line, this is partially due to the fact that this is the interface Longhorn traffic will use. My “servers” here only have a single 10 Gig interface so this is what I have set my management interface to use, this has the drawback of all traffic flowing into a guest cluster happening over the 1 Gig lines. If I had decided to ignore this and set the management interface to the 1 Gig lines, replica rebuild times would be significantly increased. Harvester is also responsible for promoting and demoting nodes in the cluster to master, because of this I did not set any node to a specific role like I did last time.

Harvester Cluster Addons

Unlike the previous setup only a single addon has been enabled: rancher-monitoring, I made some tweaks to the values increasing the retention size, along with memory limits. This is highly specific to your own setup but the default values do seem to be very low and in my previous deployment I saw OOM pod errors. Below are the values I used:

Prometheus:

Key	Value
Retention	30d
Retention Size	50GIB
Requested CPU	750m
Requested Memory	1750Mi
CPU Limit	2000m
Memory Limit	5000Mi

Prometheus Node Exporter:

Key	Value
Requested CPU	100m
Requested Memory	30Mi
CPU Limit	200m
Memory Limit	180Mi

Grafana:

Key	Value
Requested CPU	100m
Requested Memory	200Mi
CPU Limit	400m
Memory Limit	1000Mi

Alertmanager:

Key	Value
Retention	120h
Requested CPU	100m
Requested Memory	100Mi
CPU Limit	1000m
Memory Limit	600Mi

Harvester Terraform

Harvester has a Terraform provider! I decided to use Harvesters Terraform provider where possible, it does not have 1:1 parity with features in Harvester currently but some IaC is better than none (maybe). Most of this Terraform should be built before deploying a cluster and can be done before importing the Harvester cluster to Rancher. Before moving on Harvesters Terraform code requires the Harvester kubeconfig file, so before proceeding you will need to download it from the “Support” page in the bottom left. When setting up the provider, you just need to pass the kubeconfig file location and context.

provider "harvester" {
  kubeconfig  = "~/.kube/harvester.yaml"
  kubecontext = "local"
}

Note:

If like me you are using a DNS name rather than the IP to access the Harvester UI you will need to edit the server key in the kubeconfig. When you download the kubeconfig the server key will be replaced with the hostname you downloaded it from rather than the IP, however, the certificate was generated without the hostname in the TLS SAN so you will get errors. So again: swap the hostname with the IP in the kubeconfig file.

SSH Keys

Adding an SSH key allows you to select your SSH key from a dropdown and pass it into a VM when creating said VM.

resource "harvester_ssh_key" "jhanafin-key" {
  name      = "jhanafin-key"
  namespace = "harvester-public"

  public_key = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDzHX5L4uTO37kSJb5u0pUpgFwXbHJJzKA/mxhMzA6ZL jhanafin@workstation.main.lan"
}

Storage Classes

I have two tiers of storage in my cluster HDD and SSD, as you can imagine SSD space is more limited than HDD so in general I tend to only deploy database workloads to SSDs unless there is a reason for a workload to need to be on SSDs. These storage classes can be referenced by guest clusters allowing you pass these into them. If you also have tierd storage make sure you label your disks, I chose “hdd”, “ssd”, and “nvme” then used the TF code below to create two storage classes to refer to these tags.

resource "harvester_storageclass" "longhorn-fast" {
  name                   = "longhorn-fast"
  allow_volume_expansion = true
  is_default             = false
  reclaim_policy         = "Delete"
  volume_binding_mode    = "Immediate"
  volume_provisioner     = "driver.longhorn.io"

  parameters = {
    "migratable"             = "true"
    "numberOfReplicas"       = "3"
    "staleReplicaTimeout"    = "30"
    "diskSelector"           = "ssd,nvme"
  }
}

resource "harvester_storageclass" "longhorn-slow" {
  name                   = "longhorn-slow"
  allow_volume_expansion = true
  is_default             = true
  reclaim_policy         = "Delete"
  volume_binding_mode    = "Immediate"
  volume_provisioner     = "driver.longhorn.io"

  parameters = {
    "migratable"             = "true"
    "numberOfReplicas"       = "3"
    "staleReplicaTimeout"    = "30"
    "diskSelector"           = "hdd"
  }
}

You will be stuck with the two storage classes longhorn and harvester-longhorn this isn’t a huge deal but note if like me you have multiple tiers of storage the default classes will match on either HDDs or SSDs, this means replicas will be built on either type of disk.

Networking

A network in Harvester is made up of a few different components: the cluster network, VLAN config, and network, each one depends on the last. As you can see from the terraform snippets below all nodes share a physical interface named “enp3s0” so all four nodes will use this interface. You can select all nodes (default), only specific nodes, or even match nodes based on labels. My setup below is unremarkable overall.

resource "harvester_clusternetwork" "cluster-net" {
  name = "cluster-net"
}

resource "harvester_vlanconfig" "cluster-vlans" {
  name = "cluster-vlan"
  cluster_network_name = harvester_clusternetwork.cluster-net.name
  depends_on = [
    resource.harvester_clusternetwork.cluster-net
  ]

  uplink {
    nics = [
      "enp3s0"
    ]

    bond_mode = "active-backup"
    bond_miimon = -1
    mtu       = 1500
  }
}

The following two VLANs do both have DHCP servers so hypothetically I do not need to provide a route_cidr or route_gateway, however in my experience the DHCP mode did not appear to work. I am not well versed in what this is truly needed for but when the settings are correct the “Route Connectivity” column in Harvester will accurately show “Active”. It appears to only be used for determining if a VLAN is setup properly and that’s about it.

resource "harvester_network" "iot-vlan" {
  name      = "iot"
  namespace = "harvester-public"
  cluster_network_name = harvester_clusternetwork.cluster-net.name
  depends_on = [
    resource.harvester_vlanconfig.cluster-vlans,
  ]

  vlan_id = 5

  route_mode    = "manual"
  route_cidr    = "10.0.5.1/25"
  route_gateway = "10.0.5.1"
}

resource "harvester_network" "vm-vlan" {
  name      = "vm"
  namespace = "harvester-public"
  cluster_network_name = harvester_clusternetwork.cluster-net.name
  depends_on = [
    resource.harvester_vlanconfig.cluster-vlans,
  ]

  vlan_id = 7

  route_mode    = "manual"
  route_cidr    = "10.0.7.1/26"
  route_gateway = "10.0.7.1"
}

Harvester VM Images

If you know me personally you would know my hatred for XFS knows no bounds. I understand it has many features, and is more performant than ext4, however XFS is as useful as RAID5 BTRFS. Don’t get me wrong K8s isn’t a fan of hard power cuts either but at least it can handle it. As a result I made a very simple Packer build to resolve this. Thankfully kube-virt is just QEMU, so Packer can create QEMU VMs and afterwards they can be exported and uploaded to Harvester. It is crucial if you make your own VMs that you install a few packages:

qemu-guest-agent
gdisk
cloud-utils-growpart
cloud-init
tar

The kickstart snippet below is needed for cloud-init and qemu-guest-agent both services should be enabled at this point but not started. Packer will need to reboot and SSH into the VM, Packer will use root by default for ssh hence the final line in the snippet.

%post --log=/root/post.log
dnf install epel-release -y
dnf update -y
dnf install qemu-guest-agent gdisk cloud-utils-growpart cloud-init tar -y
systemctl enable cloud-init 
systemctl enable --now qemu-guest-agent
sed -i '/#PermitRootLogin*/c\PermitRootLogin yes' /etc/ssh/sshd_config
%end

Make sure to enable sshd as well:

# Services
services --enabled=sshd

After the VM is created you can import it into Harvester as an image and can now be used with cloud-init.

Deploying a Guest Cluster

The Rancher and Harvester clusters are now both setup and ready to go. At this point Rancher MCM (and its RKE2 cluster) is setup and should have very little configured other than the login. Harvester is also up, and now it is setup with:

The rancher-monitoring addon
SSH keys
Custom storage classes
Networking
Custom VM images

Import Harvester Cluster

Importing a Harvester cluster is very easy but does require an extra step that is easy to miss. To start, login to Rancher:

Select “Virtualization Management”
- This is in the far left column near the bottom
Select “Harvester Clusters”
Select “Import Existing”
Give the cluster a name
Submit
Copy the URL provided

Now login to Harvester:

Go to the Advanced / Settings page of the target Harvester’s UI
Find the “cluster-registration-url” setting and click the -> Edit Setting button
Input the following registration URL and click the Save button

Wait a short while as Harvester and Rancher get setup. Once the Harvester cluster is imported you will need to create a cloud credential. From Rancher:

Select “Cluster Management”
Select “Cloud Credentials”
Select “Create”
Select “Harvester”
Give the credentials a name
Select your imported cluster from the dropdown.

At this point Harvester is imported in Rancher and we now have a cloud credential for use in deploying guest clusters, you can test by deploying via the Rancher UI.

Deploying Guest Clusters with Terraform “gotchas”

In this final section I am going to cover the “gotchas” when deploying a guest cluster via Terraform. The Rancher Terraform provider lacks one very important feature for deploying guest clusters so we will need to use more than just the Rancher provider, this is because the provider lacks the ability to create the necessary secrets in the Harvester cluster to enable the Harvester CPI on the guest cluster. Before moving on I created a project when I originally started this so I will be placing my guest clusters in projects based on their purpose, this is mostly just an organization issue for me and is not strictly needed, however this does impact the Terraform code to come later. If you made the project via terraform you can collect the id and namespace that way however the UI provides access to this info also. From the Rancher UI:

Select “Virtualization Management”
Select your Harvester cluster
Select “Projects/Namespaces”
Find your project and click the three dots on the far right
Select “Edit YAML”
Copy your name and namespace
- name is under metadata.name
- namespace is under metadata.namespace

Note:

Rancher does not use metadata.name as the name in the UI that’s why these are named gibberish, the name you see in the UI is spec.displayName.

Before deploying the guest cluster we need to create the secret needed by the downstream cluster to access features of Harvester like the loadbalancer and storage. So to create the secret you will need to download the Harvester kubeconfig and use that in the Kubernetes provider config, like so:

provider kubernetes {
  config_path  = data.sops_file.kubernetes.data["config_path"]
  config_context = data.sops_file.kubernetes.data["config_context"]
}

Now to go about making the secret for the guest cluster, and this will include creating the namespace:

resource "kubernetes_namespace_v1" "namespace" {
  metadata {
    name = var.namespace
    labels = {
      "field.cattle.io/projectId" = var.project_id
    }
    annotations = {
      "field.cattle.io/projectId" = "${var.project_namespace}:${var.project_id}"
    }
  }
  lifecycle {
    ignore_changes = [
      metadata[0].annotations
    ]
  }
}

resource "kubernetes_service_account_v1" "k8s-sa" {
  depends_on  = [kubernetes_namespace_v1.namespace]
  metadata {
    name      = var.cluster_name
    namespace = var.namespace
  }
}

resource "kubernetes_cluster_role_binding_v1" "k8s-sa-crb" {
  depends_on  = [kubernetes_service_account_v1.k8s-sa]
  metadata {
    name      = "${kubernetes_service_account_v1.k8s-sa.metadata.0.namespace}-${kubernetes_service_account_v1.k8s-sa.metadata.0.name}"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "ClusterRole"
    name      = "harvesterhci.io:csi-driver"
  }
  subject {
    kind      = "ServiceAccount"
    name      = kubernetes_service_account_v1.k8s-sa.metadata.0.name
    namespace = kubernetes_service_account_v1.k8s-sa.metadata.0.namespace
  }
}

resource "kubernetes_role_binding_v1" "k8s-sa-rb" {
  depends_on  = [kubernetes_service_account_v1.k8s-sa]
  metadata {
    name      = kubernetes_service_account_v1.k8s-sa.metadata.0.name
    namespace = kubernetes_service_account_v1.k8s-sa.metadata.0.namespace
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "ClusterRole"
    name      = "harvesterhci.io:cloudprovider"
  }
  subject {
    kind      = "ServiceAccount"
    name      = kubernetes_service_account_v1.k8s-sa.metadata.0.name
    namespace = kubernetes_service_account_v1.k8s-sa.metadata.0.namespace
  }
}

resource "kubernetes_secret_v1" "k8s-secret" {
  depends_on                     = [kubernetes_cluster_role_binding_v1.k8s-sa-crb]
  type                           = "kubernetes.io/service-account-token"
  wait_for_service_account_token = true
  metadata {
    name = var.cluster_name
    namespace = kubernetes_service_account_v1.k8s-sa.metadata.0.namespace
    annotations = {
      "kubernetes.io/service-account.name" = kubernetes_service_account_v1.k8s-sa.metadata.0.name
    }
  }

This will create the secret, now we need to create “machine_selector” config:

resource "local_file" "machine_selector" {
  depends_on = [ kubernetes_secret_v1.k8s-secret ]
  filename = "${path.module}/kubeconfig.yaml"
  content  = <<-EOT
    cloud-provider-name: "harvester"
    cloud-provider-config: |-
      apiVersion: v1
      kind: Config
      clusters:
        - name: default
          cluster:
            server: ${var.harvester_url}
            certificate-authority-data: ${base64encode(kubernetes_secret_v1.k8s-secret.data["ca.crt"])}
      contexts:
        - name: default
          context:
            cluster: default
            namespace: ${kubernetes_service_account_v1.k8s-sa.metadata.0.namespace}
            user: default
      current-context: default
      users:
        - name: default
          user:
            token: ${kubernetes_secret_v1.k8s-secret.data["token"]}
  EOT
}

data "local_file" "machine_selector" {
  depends_on = [ local_file.machine_selector ]
  filename = "${path.module}/kubeconfig.yaml"
}

There is probably a better way to do this rather than creating the file locally but it does work. This file will then need to be converted into a string and added to the machine_selector_config key in the Rancher providers rancher2_cluster_v2 resource, the path to where the file will be dropped also needs to be provided. The snippet below is heavily abbreveated:

resource "rancher2_cluster_v2" "cluster" {
  depends_on = [ data.local_file.machine_selector ]
    machine_selector_config {
      config = tostring(data.local_file.machine_selector.content)
    }
    chart_values = <<-EOT
      harvester-cloud-provider:
        cloudConfigPath: /var/lib/rancher/rke2/etc/config-files/cloud-provider-config
        global:
          cattle:
            clusterName: "${var.cluster_name}"
    EOT
    machine_global_config = var.machine_global_config
  }
}

Conclusion

After spending more time with Harvester and actually deploying it in a supported model I have grown to like it a lot. There is added complexity and overhead no doubt, when I originally assembled the hardware for my cluster I went with CPUs and RAM that made sense for a single bare metal k8s cluster with a good bit more headroom. This does mean my production cluster of 3 masters (4 CPU, 16GB RAM), and 3 workers (4 CPU, 64GB of RAM) consume roughly half of all my RAM available to me: cluster_resources

Someday I plan on potentially moving back to a bare-metal RKE2 cluster, the simplicity of a small cluster really is very hard to beat (especially for a homelab).

This is a post in the Deploying Harvester series.
Other posts in this series:

September 10, 2024 - Deploying Harvester (Part 3): The Solution
June 29, 2024 - Deploying Harvester (Part 2): The Struggles
June 3, 2024 - Deploying Harvester (Part 1): The Search