November 14, 2022

MLOps platform on Rancher RKE2 Kubernetes Cluster — Bare Metal environment.

Kubernetes pods

Published by

Author's Bio: Shanker JJ is a Senior InfraOps Engineer, part of the AI Engineering team at AI Inside Inc., Japan. He focuses on building production-ready ML Operations Infrastructure, ML services, tools, and data pipelines.

Kubeflow installation documents cover the environment setup through packaged distribution or public cloud environments. This blog covers the prerequisite environment setup and kubeflow 1.6.0 installation on Rancher RKE2 Kubernetes environment in a bare-metal server.

Overview:

MLOps Platform covers the deployment procedure of Kubeflow on Rancher RKE2 Kubernetes cluster deployed in a bare metal environment. #RKE2 #Kubeflow

Kubernetes deprecates support for Docker as a container runtime starting with Kubernetes version 1.20. So decided to use RKE2 as the Kubernetes cluster distro focusing on security and support for “Container runtime (Special mention to the Rancher community support)”.

Note: RKE2 Kubernetes v1.22.15+rke2r1 supported for latest Kubeflow release v1.6.0. RKE2 kubernetes latest release is v1.25, but not supported for Kubeflow v1.6.0.

Prerequisites:

  • Install Ubuntu 20.04 in all 3 nodes(1server + 2agent).
  • Following ports are open according to CNI selection and depend on the server or agent. https://docs.rke2.io/install/requirements/#networking and here we are going to set up the Kubernetes platform in an air gap environment behind the proxy.

Following are the steps we’ll go through:

  • RKE2 Server setup
  • RKE2 Agent setup
  • Storage Class setup
  • Kustomize setup
  • Kubeflow setup

RKE2 Server setup:

Download RKE2 images & manifest source for RKE2 server setup by executing the following commands:

mkdir /home/user/rke2-artifacts && cd /home/user/rke2-artifacts
wget https://github.com/rancher/rke2/releases/download/v1.22.15%2Brke2r1/rke2-images.linux-amd64.tar.zst
wget https://github.com/rancher/rke2/releases/download/v1.22.15%2Brke2r1/rke2.linux-amd64.tar.gz
wget https://github.com/rancher/rke2/releases/download/v1.22.15%2Brke2r1/sha256sum-amd64.txt
curl -sfL https://get.rke2.io --output install.sh 

In case planning to set up a Kubernetes environment behind a proxy, create “/etc/default/rke2-server” file.

>> vim /etc/default/rke2-server
HTTP_PROXY="http://<proxy server ip>:<proxy port>"
HTTPS_PROXY="http://<proxy server ip>:<proxy port>"
NO_PROXY="localhost,127.0.0.1,10.43.0.0/16,10.42.0.0/16,.svc,.cluster.local"
CONTAINERD_HTTP_PROXY="http://<proxy server ip>:<proxy port>"
CONTAINERD_HTTPS_PROXY="http://<proxy server ip>:<proxy port>"
CONTAINERD_NO_PROXY="localhost,127.0.0.1,10.43.0.0/16,10.42.0.0/16,.svc,.cluster.local"

Create a RKE2 server config file.

>> mkdir -p /etc/rancher/rke2
>> vim /etc/rancher/rke2/config.yaml
tls-san:
 - Add all server ips
 - Add all expected TLS san url
 - svc
 - cluster.local
cni:
 - <cni of your choice or leave blank for default canal>
advertise-address: <server ip/lb incase of HA>
write-kubeconfig-mode: 644
node-label:
 - "type=gpu-node"

Install RKE2 server using the following command:

INSTALL_RKE2_VERSION=v1.22.15+rke2r1
INSTALL_RKE2_ARTIFACT_PATH=/home/user/rke2-artifacts sh install.sh`

Start rke2-server services and setup:

systemctl enable rke2-server.service
systemctl start rke2-server.service

Kubeconfig is located in “/etc/rancher/rke2/rke2.yaml” and binary files are in “/var/lib/rancher/rke2/bin”.

Execute the below commands to set environment variables to use the kubectl command and interact with the RKE2 cluster:

export PATH=/var/lib/rancher/rke2/bin:$PATH
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml

RKE2 Agent setup:

Download RKE2 images & manifest source for RKE2 agent setup by executing the following commands:

mkdir /home/user/rke2-artifacts && cd /home/user/rke2-artifacts
wget https://github.com/rancher/rke2/releases/download/v1.21.5%2Brke2r2/rke2-images.linux-amd64.tar.zst
wget https://github.com/rancher/rke2/releases/download/v1.21.5%2Brke2r2/rke2.linux-amd64.tar.gz
wget https://github.com/rancher/rke2/releases/download/v1.21.5%2Brke2r2/sha256sum-amd64.txt
curl -sfL https://get.rke2.io — output install.shInstall rke2-agent using the following command.

Install RKE2 agent using the following command:

export CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock
export CONTAINERD_ADDRESS=/run/k3s/containerd/containerd.sockexport INSTALL_RKE2_TYPE="agent"
INSTALL_RKE2_VERSION=v1.22.15+rke2r1
INSTALL_RKE2_ARTIFACT_PATH=/home/user/rke2-artifacts sh install.sh

In case planning to set up a Kubernetes environment behind a proxy, create “/etc/default/rke2-agent” file.

>> vim /etc/default/rke2-agent
HTTP_PROXY="http://<proxy server ip>:<proxy port>"
HTTPS_PROXY="http://<proxy server ip>:<proxy port>"
NO_PROXY="localhost,127.0.0.1,10.43.0.0/16,10.42.0.0/16,.svc,.cluster.local"
CONTAINERD_HTTP_PROXY="http://<proxy server ip>:<proxy port>"
CONTAINERD_HTTPS_PROXY="http://<proxy server ip>:<proxy port>" CONTAINERD_NO_PROXY="localhost,127.0.0.1,10.43.0.0/16,10.42.0.0/16,.svc,.cluster.local"

Create RKE2 config file with the rke2-server token to join the cluster:

>> mkdir -p /etc/rancher/rke2/
>> vim /etc/rancher/rke2/config.yaml
token: <copy token from rke2-server node /var/lib/rancher/rke2/token>
server: https://<rke2-server ip / lb incase of HA>:9345
node-label:
 - "type=gpu-node"

Start rke2-agent services:

systemctl enable rke2-agent.service
systemctl start rke2-agent.service

Install helm with the following command in RKE2 server:

curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
sudo apt-get install apt-transport-https --yes
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm

Storage class setup:

Execute the below command to set up “localpath” storage class setup.

kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml

Kustomize setup:

Kustomize 3.2.0 is supported for Kubeflow 1.6.0, don’t install the latest version.Execute the below command to setup kustomize 3.2.0 :

curl -Lo kustomize https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64 && chmod +x kustomize && sudo mv kustomize /usr/local/bin/

Kubeflow setup:

Most of the installation procedures covered in kubeflow installation documents are towards cloud providers. It’s good to use kustomize for bare metal and any Kubernetes distro.

Clone the kubeflow manifest files or download kubeflow manifest files from https://github.com/kubeflow/manifests/tree/v1.6-branch

git clone git@github.com:kubeflow/manifests.git

checkout the “v1.6-branch” & Generate a password hash using the below command:

python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'
  • update the generated hash in “common/dex/base/config-map.yaml”.
  • add storage class in the following files, if we plan to use other than the default one.
common/oidc-authservice/base/pvc.yaml
apps/katib/upstream/components/mysql/pvc.yaml
apps/pipeline/upstream/third-party/minio/base/minio-pvc.yaml
apps/pipeline/upstream/third-party/mysql/base/mysql-pv-claim.yaml
  • modify the size of minio-pvc based on expected artifacts size the needed and storage availability.

Execute the following command to install kubeflow:

while ! kustomize build example | sudo kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

After running this, sometimes Kubernetes bugs, out and the terminal outputs “Retrying to apply resources.” If this happens, it will automatically keep retrying until all the pods get the green light to spin up. Wait until all the pods have a RUNNING status before proceeding.

Check the status of the pods running by executing following commands:

kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
kubectl get pods -n kubeflow-user-example-com

Patch the ingress gateway to nodeport:

kubectl patch svc istio-ingressgateway -n istio-system -p '{"spec": {"type": "NodePort"}}'

Note: If you have LoadBalancer in the environment instead of NodePort, set it as LoadBalancer.

Now we can access kubeflow using default credentials “user@example.com” and password as “12341234”

Troubleshooting:

As mentioned earlier, the RKE2 cluster built on top of the Containerd plane, and we can use the crictl command for troubleshooting if necessary. To use the crictl command perform the following setting in the node.

vim /etc/crictl.yaml
runtime-endpoint: unix:///run/k3s/containerd/containerd.sock
image-endpoint: unix:///run/k3s/containerd/containerd.sock
timeout: 10

Feel free to reach me on LinkedIn if you have some questions.