If you are like me, you keep reading and hearing about Kubernetes and how it is the future of deploying applications, be it small little apps or the next billion dollar SaaS app. But I never really had the opportunity to collect hands-on experience on how to actually get things done with kubernetes. I know that basically every major cloud provider offers some form of managed Kubernetes, but I never had the feeling that I would understand the topic well by relying on them. That brought me to an idea. The Hetzner cloud always has been dear to my heart, because they offer you some of the best cloud services in regard to how much bang you get for your buck. On the other hand, one of the frequent complaints people express towards Hetzner is that their cloud is rather barebones when compared to other offerings. So I got the Idea: What if roll my own Kubernetes-Cluster on Hetzner VPS hosts and jump through all the hoops of maintaining and using Kubernetes?
The code for this post is hosted on Codeberg, feel free to take a look!
Table of Contents
- Architecture
- Choosing a Kubernetes Distribution
- Creating the cluster nodes
- Ingress and Cloud Controller Manager
- Cert-Manager and TLS
- Deploying ArgoCD
- Costs of the Setup
- Further Plans
Architecture
For simplicity, I decided to go with a simple three node setup for my cluster. Every node is a control plane node, hosting etcd and the kubernetes API. This allows for High-Availability in the sense that one node of my cluster can fail without the cluster losing its ability to work.
Choosing a Kubernetes Distribution
Similiar to how Linux comes in many different flavours you can choose from via distributions, Kubernetes also comes in many different editions, all with their own tradeoffs and benefits. There is vanilla Kubernetes, which requires you to make many decisions. It is generally not recommended to roll vanilla Kubernetes for production scenarios, but it can provide great value if used for educational purposes. At first tempted, I ultimately decided against vanilla Kubernetes. I then took a look at the two main Kubernetes setups in the enterprise context - Red Hat's OpenShift and SuSE Rancher. OpenShift requires a Redhat Subscription, but you can use the open source Version OKD instead. Initially, I wanted to use OKD, but the lack of documentation and good examples made me give up really quickly. But that was okay, as the really heavy-weight nature of OKD wasn't to my liking. Rancher is really similar to OKD / OpenShift, so I wasn't really keen on trying that too. But while reading about Rancher, I stumbled upon k3s, which markets itself as "a lightweight kubernetes distribution [...]". The documentation was really pleasant to read and while researching, it appeared that k3s enjoys a really good reputation. So I decided to use k3s for my cluster setup.
Creating the cluster nodes
With a distribution selected, it was now time to get something deployed to Hetzner. I opted into using OpenTofu to manage my infrastructure in a way that can be stored in Git. But before we can deploy any compute, we need to setup some prerequisites. The nodes need to be provisioned with a ssh-key to allow for troubleshooting and retrieving configuration files from the nodes. So we need to generate a key and upload it to Hetzner cloud so it can be used by the nodes. That can be achieved by combining the hcloud provider with the tls provider, available in the OpenTofu-registry:
resource "tls_private_key" "master_node_key" {
algorithm = "ED25519"
}
resource "hcloud_ssh_key" "hcloud_master_node_key" {
name = "${var.hcloud_namespace}-master-key"
public_key = tls_private_key.master_node_key.public_key_openssh
}
We generate a ED25519 keypair and then use the hcloud provider to upload that to Hetzner. Pretty straightforward. The next thing I wanted to do is setup an internal network for cluster communication. This can be achieved similarly with terraform:
resource "hcloud_network" "cluster_network" {
name = "${var.hcloud_namespace}-cluster-net"
ip_range = var.ip_range
}
resource "hcloud_network_subnet" "cluster_network_subnet" {
type = "cloud"
network_id = hcloud_network.cluster_network.id
network_zone = var.network_zone
ip_range = var.ip_subnet
}
With that done, all prerequisites are there for deploying some nodes. This needs to be done in two steps: first, a single node is deployed that initializes the cluster:
resource "hcloud_server" "master_node" {
name = "${var.hcloud_namespace}-master-1"
image = "ubuntu-24.04"
server_type = "cax11"
location = var.hcloud_location
ssh_keys = [var.master_key_id]
public_net {
ipv4_enabled = true
ipv6_enabled = false
}
network {
network_id = var.cluster_network_id
ip = "10.0.1.1"
}
user_data = templatefile("${path.module}/../../scripts/cloud-init-master.tpl", {})
}
This creates a VPS of the cax11 type in the selected region and provisions it with ubuntu and the previously created ssh key. It also puts it into the network we have created in the step before. The cluster initialization happens in the user-data script that gets passed to the server:
packages:
- curl
users:
- name: cluster
sudo: ALL=(ALL) NOPASSWD:ALL
shell: /bin/bash
runcmd:
- apt update -y
- curl https://get.k3s.io | INSTALL_K3S_EXEC="--cluster-init --disable traefik --disable-cloud-controller --kubelet-arg cloud-provider=external --tls-san 10.0.1.1" sh -
- chown cluster:cluster /etc/rancher/k3s/k3s.yaml
- chown cluster:cluster /var/lib/rancher/k3s/server/node-token
The user-data script creates the cluster
user with sudo rights and then proceeds to install k3s via the official install script. The flags set are important:
--cluster-init
initializes a cluster with etcd and everything needed for nodes to join.--disable traefik
disables the default ingress used by k3s. This is not strictly necessary, I just need to do that as I wish to use NGINX as ingress later on.--disable-cloud-controller
important so that the Hetzner hcloud cloud controller manager can be used for leveraging existing Hetzner resources.--kubelet-arg cloud-provider=external
specifies that we use an external cloud controller manager--tls-san 10.0.1.1
required because by default, the k3s install script only puts the public IP in the subject of the TLS certificate. This makes sure that communication works via TLS on the private network.
After that, more nodes can be added to the cluster in a very similar way. In Terraform, I deploy more VPS nodes:
resource "hcloud_server" "additional_master_nodes" {
count = 2
name = "${var.hcloud_namespace}-master-${count.index + 2}"
server_type = "cax11"
image = "ubuntu-24.04"
location = var.hcloud_location
public_net {
ipv4_enabled = true
ipv6_enabled = false
}
network {
network_id = var.cluster_network_id
}
user_data = templatefile("${path.module}/../../scripts/cloud-init-worker.tpl", {
master_private_key = base64encode(var.master_private_key)
master_node_ip = "10.0.1.1"
})
depends_on = [hcloud_server.master_node]
}
The setup is almost the same as the one for the initial node, with the exception that a different user-data script is used and the private key for the initial master node is given as input. The reason why that is necessary is visible in the user data script:
packages:
- curl
users:
- name: cluster
sudo: ALL=(ALL) NOPASSWD:ALL
shell: /bin/bash
write_files:
- path: /root/.ssh/id_ed25519
encoding: b64
content: "${master_private_key}"
permissions: "0600"
runcmd:
- apt update -y
- until curl -k https://${master_node_ip}:6443; do sleep 5; done
- REMOTE_TOKEN=$(ssh -o StrictHostKeyChecking=accept-new root@${master_node_ip} sudo cat /var/lib/rancher/k3s/server/node-token)
- curl -sfL https://get.k3s.io | K3S_TOKEN=$REMOTE_TOKEN INSTALL_K3S_EXEC="server --server https://${master_node_ip}:6443 --token $REMOTE_TOKEN --disable traefik --disable-cloud-controller --kubelet-arg cloud-provider=external" sh -
Similarly, a cluster
user is created and the k3s installer gets downloaded and executed. But before that, we use ssh to connect with the initial master node and retrieve the token that is required to join the cluster. I do not know if that is a good solution, but it works for now and I couldn't come up with a better solution up until now. We should now be able to see our cluster in action when we deploy our resources. kubectl
can be used to verify that:
kubectl get nodes
You should see three nodes that all have the roles control-plane
, etcd
and master
:
NAME STATUS ROLES AGE VERSION
codeberg-k3s-master-1 Ready control-plane,etcd,master 6d4h v1.32.5+k3s1
codeberg-k3s-master-2 Ready control-plane,etcd,master 6d4h v1.32.5+k3s1
codeberg-k3s-master-3 Ready control-plane,etcd,master 6d4h v1.32.5+k3s1
This means our cluster is alive and kicking and we can now start using kubernetes to further interact with it!
Ingress and Cloud Controller Manager
The first things I wanted to get deployed into the cluster were NGINX as my ingress controller and the hcloud cloud controller manager. Because we now have a working kubernetes cluster, we do not have to fiddle around with a mix of terraform, bash and cloud-init: we can now fiddle around with a mix of terraform, kubernetes and helm. For that, I added the providers for helm and kubernetes and began writing terraform to get the stuff into my cluster. I began with the CCM. Hetzner's hcloud CCM requires that a valid token for your hcloud account is available as a cluster secret. Achieving that is trivial with terraform and the Kubernetes provider:
resource "kubernetes_secret" "hcloud_token" {
metadata {
name = "hcloud"
namespace = "kube-system"
}
data = {
token = var.hcloud_token
}
type = "Opaque"
}
This reads the token from the terraform variables and uploads it to the cluster for later use. When that is done, the installation of hcloud CCM is straight forward with helm:
resource "helm_release" "hcloud_ccm" {
name = "hcloud-cloud-controller-manager"
namespace = "kube-system"
repository = "https://charts.hetzner.cloud"
chart = "hcloud-cloud-controller-manager"
version = "1.25.1"
values = [
yamlencode({
controller = {
enabled = true
}
cloudControllerManager = {
enabled = true
hcloudToken = var.hcloud_token
}
})
]
timeout = 300
atomic = true
recreate_pods = true
force_update = true
}
This simply runs the helm chart specified with the inputs given. Note that I increased the timeout because, depending on what is currently going on in the cluster, it could take a moment until the chart is up and running. With that, Hetzner CCM is deployed and ready for usage. The NGINX ingress controller is similarily easy. You just apply the corresponding helm chart:
resource "helm_release" "nginx_ingress" {
name = var.name
namespace = var.namespace
create_namespace = true
repository = "https://kubernetes.github.io/ingress-nginx"
chart = "ingress-nginx"
version = var.chart_version
values = [
yamlencode({
controller = {
replicaCount = 2
publishService = {
enabled = true
}
service = {
type = "LoadBalancer"
externalTrafficPolicy = "Local"
}
}
})
]
timeout = 300
atomic = true
recreate_pods = true
force_update = true
}
It is basically the same procedure as with the CCM in the step before. In order to use NGINX as ingress, you need to define a service, which is also easily done in Terraform:
data "kubernetes_service" "nginx_ingress" {
metadata {
name = "nginx-ingress-ingress-nginx-controller"
namespace = var.namespace
}
depends_on = [helm_release.nginx_ingress]
}
With that, we are almost ready to start deploying real applications, the only thing that needs to be taken care of beforehand is how the cluster handles certificates and TLS. Luckily, there is a battle-tested solution that can help us.
Cert-Manager and TLS
As is custom with everything Kubernetes, I want the process of getting certificates for my services to be as automated as possible. Ideally, I don't want to think about it at all after configuring it, I want that services can just request a cert and automagically get assigned one. There is a solution that can provide just that called cert-manager
. cert-manager creates certificates as they are requested and maintains them, ensuring they don't expire unexpectedly. For it to work, you need to configure an Issuer
, which is simply a source of certificates. By default, cert-manager hands out self-signed certs. The interesting bit for me is that cert-manager fully supports the ACME protocol, which means I can use LetsEncrypt to fetch certs for my domain that work out of the box. It supports both http01 and dns01 as challenges. I decided to go for dns01, which does not require having a working http server to solve the challenge and obtain a cert. Instead, you verify your ownership of the domain by manipulating your domains DNS records. For that, you need to have your domain hosted on a supported provider and put a token for accessing their API into your secret store. As I use Cloudflare, I simply put the token into my cluster as I did before for the CCM:
resource "kubernetes_secret" "cloudflare_api_token" {
metadata {
name = "cloudflare-api-token-secret"
namespace = var.namespace
}
type = "Opaque"
data = {
token = var.cloudflare_token
}
}
Deploying cert-manager is as simple as the other things we did, as cert-manager also provides a helm chart:
resource "helm_release" "cert_manager" {
name = var.name
namespace = var.namespace
create_namespace = true
repository = "https://charts.jetstack.io"
chart = "cert-manager"
version = var.chart_version
set = [
{
name = "crds.enabled"
value = "true"
}
]
timeout = 300
atomic = true
recreate_pods = true
force_update = true
}
With that, we are almost ready to issue certificates for the cluster. The only thing left is to configure a ClusterIssuer
. This happens via a plain old manifest:
resource "kubernetes_manifest" "cluster_issuer" {
manifest = {
apiVersion = "cert-manager.io/v1"
kind = "ClusterIssuer"
metadata = {
name = var.issuer_name
}
spec = {
acme = {
server = "https://acme-v02.api.letsencrypt.org/directory"
email = var.certificate_email
privateKeySecretRef = {
name = "${var.issuer_name}-account-key"
}
solvers = [
{
dns01 = {
cloudflare = {
email = var.cloudflare_email
apiTokenSecretRef = {
name = kubernetes_secret.cloudflare_api_token.metadata[0].name
key = "token"
}
}
}
}
]
}
}
}
depends_on = [kubernetes_secret.cloudflare_api_token]
}
The interesting part is the spec, where the ACME server gets configured and the associated mail and token get set. With that, your cluster is ready to provision TLS certificates fully automated and without human action, given the configuration is correct. With that, the most important things are ready to go, which means I can now finally deploy the application that made me try kubernetes in the first place.
Deploying ArgoCD
ArgoCD is the one thing that made me do all this in the first place. I want to use Argo in order to implement the GitOps paradigm. When doing GitOps, your whole infrastructure is defined in plain text files that are under version control. The repo is the single source of truth, changes to the infrastructure are equivalent to modifying the files in the repo. ArgoCD is a tool which helps achieving exactly that, by monitoring one or multiple repos for changes and making sure that the cluster state reflects the state in the repo. As you are most likely acustomed to by now, Argo gets deployed by a helm chart too:
resource "helm_release" "argocd" {
name = var.name
namespace = var.namespace
create_namespace = true
repository = "https://argoproj.github.io/argo-helm"
chart = "argo-cd"
version = var.chart_version
set = [
{
name = "global.domain"
value = var.domain
},
{
name = "server.certificate.enabled"
value = "true"
},
{
name = "server.extraArgs"
value = "{--insecure}"
},
{
name = "server.certificate.domain"
value = var.domain
},
{
name = "server.certificate.issuer.kind"
value = "ClusterIssuer"
},
{
name = "server.certificate.issuer.name"
value = var.issuer_name
},
{
name = "server.ingress.enabled"
value = "true"
},
{
name = "server.ingress.tls"
value = "true"
},
{
name = "server.ingress.ingressClassName"
value = "nginx"
},
{
name = "server.ingress.annotations.nginx\\.ingress\\.kubernetes\\.io/ssl-redirect"
value = "true"
},
{
name = "configs.secret.create"
value = "true"
},
{
name = "configs.secret.data.admin\\.passwordMtime"
value = timestamp()
}
]
set_sensitive = [
{
name = "configs.secret.data.admin\\.password"
value = var.argo_cd_password_bcrypt
}
]
timeout = 300
atomic = true
recreate_pods = true
force_update = true
}
This deploys argo into a dedicated namespace in the cluster, obtains a certificate and configures an ingress, all fully automated. After running this, I have Argo up and running on my cluster, publicly available and with a valid TLS-Certificate. Woohoo! Although kubernetes was a hassle at some points while making this, the result of having it deploy with so many things just happening automatically is very cool.
Costs of the Setup
We are running three nodes of the CAX11 flavour. These are well suited because k3s is optimized for running on ARM based hosts and they are really cheap. At the time of writing this post, a CAX11 instance costs only 3.92€, which makes the deployed cluster clock in at just below 12€ a month, which is really good. The cost could increase a bit later down the line, if you opt into using other Hetzner resources, e.g. using Storage Boxes for your container storage provider, but it should stay manageable.
Further Plans
I am currently satisfied with my setup. In the next steps, I want to improve some general things about my cluster:
- switch to a distro optimized for container workloads, preferably immutable by default
- harden the setup so the security is sensible
- make the deployment a bit simpler (you currently have to run multiple terraform projects in a certain sequence for everything to work)
I also want to leverage the existing ArgoCD as a way to bootstrap my GitOps repo, so that really everything running on the cluster is defined in Git. But that will have to wait until I have time again.