Build a machine learning platform Kubeflow from scratch

1 Introduction to Kubeflow

1.1 What is Kubeflow

An introduction from the official website: The Kubeflow project is committed to making the deployment of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal of Kubeflow is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open source systems for ML into different infrastructures. Developers should be able to run Kubeflow anywhere Kubernetes is running.

From the introduction on the official website, we can see that Kubeflow and Kubernetes are inseparable. In general, Kubeflow is a Kubernetes-based ML workflow platform open sourced by Google, which integrates a large number of machine learning tools, such as the jupyterlab environment for interactive experiments, katib for hyperparameter adjustment, and pipeline workflow Controlled argo workflow, etc. As a "large toolbox" collection, kubeflow provides a large number of optional tools for machine learning developers, and also provides feasible tools for the implementation of machine learning projects.

1.2 Kubeflow background

Kubernetes was originally a container platform used to manage stateless applications, but in the past two years, more and more companies have used it to run various workloads, especially machine learning alchemy. Various AI companies or AI departments of Internet companies will try to run TensorFlow, Caffe, MXNet and other distributed learning tasks on Kubernetes, which brings new challenges to Kubernetes.

First of all, distributed machine learning tasks generally involve two different types of work: parameter servers (hereinafter referred to as PS) and working nodes (hereinafter referred to as workers). Moreover, learning tasks in different fields have different requirements for PS and workers, which is reflected in the difficulty of configuration in Kubernetes. Taking TensorFlow as an example, TensorFlow's distributed learning tasks usually start multiple PSs and multiple workers, and in the best practice provided by TensorFlow, each worker and PS requires different command line parameters to be passed in.

Second, the default scheduler of Kubernetes is not friendly to the scheduling of machine learning tasks. If the previous problem is only troublesome in the application and deployment phase, then the problem of low resource utilization caused by scheduling or reduced efficiency of machine learning tasks deserves special attention. Machine learning tasks have relatively high computing and network requirements. Generally speaking, all workers will use GPU for training, and in order to get a better network support, PS and workers of the same machine learning task should be placed on the On the same machine or on adjacent machines with better networks will reduce the time required for training.

In response to these problems, the Kubeflow project came into being. It uses TensorFlow as the first supported framework and defines a new resource type on Kubernetes: TFJob, which is the abbreviation of TensorFlow Job. With such a resource type, engineers who use TensorFlow for machine learning training no longer need to write complicated configurations. They only need to determine the number of PSs and workers and the input and output of data and logs according to their understanding of the business. A training mission.

In one sentence: Kubeflow is a composable, portable, and scalable machine learning technology stack built for Kubernetes.

The above is from the article kubeflow-Introduction https://www.jianshu.com/p/192f22a0b857, this introduction explains the past and present of kubeflow very well, and has a deeper understanding of kubeflow. Simply needed.

1.3 Kubeflow and machine learning

Kubeflow is a platform for data scientists who want to build and conduct ML tasks. Kubeflow is also suitable for ML engineers and operations teams who want to deploy ML systems to various environments for development, testing, and production-level services.

Kubeflow is an ML toolkit for Kubernetes.

The figure below shows Kubeflow as a platform for building machine learning system components based on Kubernetes:

Kubeflow is a glue project that combines many machine learning supports, such as model training, hyperparameter training, model deployment, etc., into containers Deploy in a standardized way, provide high availability and convenient expansion of each system in the entire process, and users who deploy kubeflow can use it to perform different machine learning tasks.

The diagram below shows the machine learning workflow in sequence. The arrows at the end of the workflow pointing to the process indicate that the machine learning task is a gradual iterative process:

In the experimental phase, you develop the model based on the initial hypothesis, and iteratively test and update the model to produce the results you are looking for:

  • Identify the problem you want the ML system to solve;
  • Collect and analyze the data needed to train the ML model;
  • Select an ML framework and algorithm, and code an initial version of the model;
  • Experiment with data and train your model.
  • Tune model hyperparameters to ensure the most efficient processing and the most accurate results.

During the production phase, you deploy a system that performs the following processes:

  • Convert the data into the format required by the training system;
  • To ensure that the model performs consistently during training and prediction, the conversion process must be the same during the experimental and production phases.
  • Train ML models.
  • Serves models for online prediction or running in batch mode.
  • Monitor the performance of the model and feed the results into your process to tune or retrain the model.

The Kubeflow components in the ML workflow are shown in the figure below

1.4 Core components

The core components that constitute Kubeflow, the official website here https://www.kubeflow.org/docs/components/ has a detailed introduction, the following is a mind map I drew:

2 Kubeflow installation guide

2.1 Frequently used links

  • Official customized installation guide repository: https://github.com/kubeflow/manifests
  • Kubeflow official warehouse: https://github.com/kubeflow/
  • kubernetes official website: https://kubernetes.io/zh-cn/
  • github proxy acceleration: https://ghproxy.com/

2.2 Installation environment

Installation Environment:

  • system version
cat /etc/redhat-release

CentOS Linux release 7.9.2009 (Core)
  • running memory
free -h
              total        used        free      shared  buff/cache   available
Mem:           110G        3.4G        105G        3.8M        891M        105G
Swap:          4.0G          0B        4.0G
  • cpu
cat /proc/cpuinfo | grep name | sort | uniq
model name	: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l
42

  • gpu
nvidia-smi
Sat Dec 24 13:01:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:06.0 Off |                    0 |
| N/A   38C    P0    25W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:07.0 Off |                    0 |
| N/A   34C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2.3 Pre-environment

The pre-environment required to install kubeflow mainly includes the following tools:

  • Kubernetes: up to 1.21
  • kustomize :3.2.0
  • kubectl

https://github.com/kubeflow/manifests#prerequisites

3 Kubernetes installation

The k8s cluster consists of Master nodes and Node (Worker) nodes. Here we only use one machine to install kubernetes.

3.1 view ip

(base) [root@server-szry1agd ~]# ip add

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:44:6c:3c brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.130/22 brd 192.168.3.255 scope global noprefixroute dynamic eth0
       valid_lft 80254sec preferred_lft 80254sec
    inet6 fe80::f816:3eff:fe44:6c3c/64 scope link 
       valid_lft forever preferred_lft forever

3.2 Modify the host name

This step is not necessary. I saw that some articles mentioned that the host name cannot have underscores.

(base) [root@server-szry1agd ~]# hostnamectl set-hostname kubuflow && bash

Comparison before and after modification
insert image description here

3.3 add host

Here you need to change your own ip and host name

(base) [root@kubuflow ~]# cat >> /etc/hosts << EOF 
> 192.168.3.130  kubuflow 
> EOF

View hosts

(base) [root@kubuflow ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
0.0.0.0     server-szry1agd.novalocal
192.168.3.130  kubuflow 

3.4 Turn off the firewall, turn off selinux

(base) [root@kubuflow ~]# systemctl stop firewalld
(base) [root@kubuflow ~]# systemctl disable firewalld
(base) [root@kubuflow ~]# sed -i 's/enforcing/disabled/' /etc/selinux/config # 永久
(base) [root@kubuflow ~]# setenforce 0  # 临时
setenforce: SELinux is disabled

3.5 close swap

(base) [root@kubuflow ~]# swapoff -a
(base) [root@kubuflow ~]# sed -i 's/.*swap.*/#&/' /etc/fstab

3.6 Forwarding IPv4 and letting iptables see bridged traffic

Verify that the br_netfilter module is loaded by running lsmod | grep br_netfilter. To load this module explicitly, run sudo modprobe br_netfilter. In order for the Linux node's iptables to see bridge traffic correctly, make sure net.bridge.bridge-nf-call-iptables is set to 1 in the sysctl configuration.

cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

# 设置所需的 sysctl 参数,参数在重新启动后保持不变
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF

# 应用 sysctl 参数而不重新启动
sudo sysctl --system

3.7 Time Synchronization

(base) [root@kubuflow ~]# yum install ntpdate -y
(base) [root@kubuflow ~]# ntpdate time.windows.com
24 Dec 14:21:55 ntpdate[18177]: adjust time server 52.231.114.183 offset 0.003717 sec

3.8 install docker

wget https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo -O /etc/yum.repos.d/docker-ce.repo
 yum -y install docker-ce
 systemctl enable docker && systemctl start docker && systemctl status docker

Successful installation

(base) [root@kubuflow ~]# docker --version
Docker version 20.10.22, build 3a2c30b
(base) [root@kubuflow ~]# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
(base) [root@kubuflow ~]# 

3.9 docker adds domestic image sources

(base) [root@kubuflow ~]# cat > /etc/docker/daemon.json << EOF
> {
>     "registry-mirrors": [
>         "http://hub-mirror.c.163.com",
>         "https://docker.mirrors.ustc.edu.cn",
>         "https://registry.docker-cn.com"
>     ]
> }
> EOF
(base) [root@kubuflow ~]# # 使配置生效
(base) [root@kubuflow ~]# systemctl daemon-reload
(base) [root@kubuflow ~]# 
(base) [root@kubuflow ~]# # 重启Docker
(base) [root@kubuflow ~]# systemctl restart docker

3.10 Add the yum source of kubernetes

(base) [root@kubuflow ~]# cat > /etc/yum.repos.d/kubernetes.repo << EOF
> [kubernetes]
> name=Kubernetes
> baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64
> enabled=1
> gpgcheck=0
> repo_gpgcheck=0
> gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg
> https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
> EOF

3.11 Install kubeadm, kubelet and kubectl

(base) [root@kubuflow ~]# yum -y install kubelet-1.21.5-0 kubeadm-1.21.5-0 kubectl-1.21.5-0 
(base) [root@kubuflow ~]# systemctl enable kubelet

3.12 Deploy Kubernetes Master

(base) [root@kubuflow ~]#  kubeadm init --apiserver-advertise-address=192.168.3.130 --image-repository registry.aliyuncs.com/google_containers  --kubernetes-version v1.21.5  --service-cidr=10.96.0.0/12  --pod-network-cidr=10.244.0.0/16 --ignore-preflight-errors=all

Parameter Description:

  • –apiserver-advertise-address=192.168.3.130
    This parameter is the IP address of the master host. For example, the IP of my Master host is: 192.168.3.130, which is also the 2.4.1IP address we are seeing
  • –image-repository registry.aliyuncs.com/google_containers
    This is the mirror address. Since foreign addresses cannot be accessed, the Aliyun warehouse address used: repository
    registry.aliyuncs.com/google_containers
  • –kubernetes-version=v1.21.5 This parameter is the version number of the downloaded k8s software
  • –service-cidr=10.96.0.0/12 The IP address after this parameter is directly applied to 10.96.0.0/12
    , and it can also be applied to future installations, do not change
  • –pod-network-cidr=10.244.0.0/16
    The IP segment that can be used by the network between pod nodes inside k8s cannot be written the same as service-cidr. If you don’t know how to configure it, use this 10.244.0.0/16 first
  • --ignore-preflight-errors=all adding this will ignore errors

After executing the statement, if you see the following information, the installation is successful.

[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.3.130:6443 --token nupk90.vnoqbfgexf8d2lhp \
	--discovery-token-ca-cert-hash sha256:715fac4463bd6b5b4de53e9356002eed12652fa8c6def12789ccb5d6f73fefaa 
(base) [root@kubuflow ~]# 

3.13 Create kube configuration file

(base) [root@kubuflow ~]# mkdir -p $HOME/.kube
(base) [root@kubuflow ~]# sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
(base) [root@kubuflow ~]# sudo chown $(id -u):$(id -g) $HOME/.kube/config

(base) [root@kubuflow ~]# kubectl get nodes
NAME       STATUS     ROLES                  AGE     VERSION
kubuflow   NotReady   control-plane,master   5m45s   v1.21.5

3.14 Install Pod Network Plugin (CNI)

cat > calico.yaml  << EOF
---
# Source: calico/templates/calico-config.yaml
# This ConfigMap is used to configure a self-hosted Calico installation.
kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-config
  namespace: kube-system
data:
  # Typha is disabled.
  typha_service_name: "none"
  # Configure the backend to use.
  calico_backend: "bird"

  # Configure the MTU to use
  veth_mtu: "1440"

  # The CNI network configuration to install on each node.  The special
  # values in this config will be automatically populated.
  cni_network_config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "log_level": "info",
          "datastore_type": "kubernetes",
          "nodename": "__KUBERNETES_NODE_NAME__",
          "mtu": __CNI_MTU__,
          "ipam": {
              "type": "calico-ipam"
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "kubeconfig": "__KUBECONFIG_FILEPATH__"
          }
        },
        {
          "type": "portmap",
          "snat": true,
          "capabilities": {"portMappings": true}
        }
      ]
    }

---
# Source: calico/templates/kdd-crds.yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: felixconfigurations.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: FelixConfiguration
    plural: felixconfigurations
    singular: felixconfiguration
---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: ipamblocks.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPAMBlock
    plural: ipamblocks
    singular: ipamblock

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: blockaffinities.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: BlockAffinity
    plural: blockaffinities
    singular: blockaffinity

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: ipamhandles.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPAMHandle
    plural: ipamhandles
    singular: ipamhandle

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: ipamconfigs.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPAMConfig
    plural: ipamconfigs
    singular: ipamconfig

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: bgppeers.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: BGPPeer
    plural: bgppeers
    singular: bgppeer

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: bgpconfigurations.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: BGPConfiguration
    plural: bgpconfigurations
    singular: bgpconfiguration

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: ippools.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPPool
    plural: ippools
    singular: ippool

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: hostendpoints.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: HostEndpoint
    plural: hostendpoints
    singular: hostendpoint

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: clusterinformations.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: ClusterInformation
    plural: clusterinformations
    singular: clusterinformation

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: globalnetworkpolicies.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: GlobalNetworkPolicy
    plural: globalnetworkpolicies
    singular: globalnetworkpolicy

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: globalnetworksets.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: GlobalNetworkSet
    plural: globalnetworksets
    singular: globalnetworkset

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: networkpolicies.crd.projectcalico.org
spec:
  scope: Namespaced
  group: crd.projectcalico.org
  version: v1
  names:
    kind: NetworkPolicy
    plural: networkpolicies
    singular: networkpolicy

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: networksets.crd.projectcalico.org
spec:
  scope: Namespaced
  group: crd.projectcalico.org
  version: v1
  names:
    kind: NetworkSet
    plural: networksets
    singular: networkset
---
# Source: calico/templates/rbac.yaml

# Include a clusterrole for the kube-controllers component,
# and bind it to the calico-kube-controllers serviceaccount.
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-kube-controllers
rules:
  # Nodes are watched to monitor for deletions.
  - apiGroups: [""]
    resources:
      - nodes
    verbs:
      - watch
      - list
      - get
  # Pods are queried to check for existence.
  - apiGroups: [""]
    resources:
      - pods
    verbs:
      - get
  # IPAM resources are manipulated when nodes are deleted.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ippools
    verbs:
      - list
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
      - ipamblocks
      - ipamhandles
    verbs:
      - get
      - list
      - create
      - update
      - delete
  # Needs access to update clusterinformations.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - clusterinformations
    verbs:
      - get
      - create
      - update
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-kube-controllers
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: calico-kube-controllers
subjects:
- kind: ServiceAccount
  name: calico-kube-controllers
  namespace: kube-system
---
# Include a clusterrole for the calico-node DaemonSet,
# and bind it to the calico-node serviceaccount.
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-node
rules:
  # The CNI plugin needs to get pods, nodes, and namespaces.
  - apiGroups: [""]
    resources:
      - pods
      - nodes
      - namespaces
    verbs:
      - get
  - apiGroups: [""]
    resources:
      - endpoints
      - services
    verbs:
      # Used to discover service IPs for advertisement.
      - watch
      - list
      # Used to discover Typhas.
      - get
  - apiGroups: [""]
    resources:
      - nodes/status
    verbs:
      # Needed for clearing NodeNetworkUnavailable flag.
      - patch
      # Calico stores some configuration information in node annotations.
      - update
  # Watch for changes to Kubernetes NetworkPolicies.
  - apiGroups: ["networking.k8s.io"]
    resources:
      - networkpolicies
    verbs:
      - watch
      - list
  # Used by Calico for policy information.
  - apiGroups: [""]
    resources:
      - pods
      - namespaces
      - serviceaccounts
    verbs:
      - list
      - watch
  # The CNI plugin patches pods/status.
  - apiGroups: [""]
    resources:
      - pods/status
    verbs:
      - patch
  # Calico monitors various CRDs for config.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - globalfelixconfigs
      - felixconfigurations
      - bgppeers
      - globalbgpconfigs
      - bgpconfigurations
      - ippools
      - ipamblocks
      - globalnetworkpolicies
      - globalnetworksets
      - networkpolicies
      - networksets
      - clusterinformations
      - hostendpoints
      - blockaffinities
    verbs:
      - get
      - list
      - watch
  # Calico must create and update some CRDs on startup.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ippools
      - felixconfigurations
      - clusterinformations
    verbs:
      - create
      - update
  # Calico stores some configuration information on the node.
  - apiGroups: [""]
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  # These permissions are only requried for upgrade from v2.6, and can
  # be removed after upgrade or on fresh installations.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - bgpconfigurations
      - bgppeers
    verbs:
      - create
      - update
  # These permissions are required for Calico CNI to perform IPAM allocations.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
      - ipamblocks
      - ipamhandles
    verbs:
      - get
      - list
      - create
      - update
      - delete
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ipamconfigs
    verbs:
      - get
  # Block affinities must also be watchable by confd for route aggregation.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
    verbs:
      - watch
  # The Calico IPAM migration needs to get daemonsets. These permissions can be
  # removed if not upgrading from an installation using host-local IPAM.
  - apiGroups: ["apps"]
    resources:
      - daemonsets
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: calico-node
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: calico-node
subjects:
- kind: ServiceAccount
  name: calico-node
  namespace: kube-system

---
# Source: calico/templates/calico-node.yaml
# This manifest installs the calico-node container, as well
# as the CNI plugins and network config on
# each master and worker node in a Kubernetes cluster.
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: calico-node
  namespace: kube-system
  labels:
    k8s-app: calico-node
spec:
  selector:
    matchLabels:
      k8s-app: calico-node
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        k8s-app: calico-node
      annotations:
        # This, along with the CriticalAddonsOnly toleration below,
        # marks the pod as a critical add-on, ensuring it gets
        # priority scheduling and that its resources are reserved
        # if it ever gets evicted.
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      nodeSelector:
        beta.kubernetes.io/os: linux
      hostNetwork: true
      tolerations:
        # Make sure calico-node gets scheduled on all nodes.
        - effect: NoSchedule
          operator: Exists
        # Mark the pod as a critical add-on for rescheduling.
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoExecute
          operator: Exists
      serviceAccountName: calico-node
      # Minimize downtime during a rolling upgrade or deletion; tell Kubernetes to do a "force
      # deletion": https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods.
      terminationGracePeriodSeconds: 0
      priorityClassName: system-node-critical
      initContainers:
        # This container performs upgrade from host-local IPAM to calico-ipam.
        # It can be deleted if this is a fresh installation, or if you have already
        # upgraded to use calico-ipam.
        - name: upgrade-ipam
          image: calico/cni:v3.11.3
          command: ["/opt/cni/bin/calico-ipam", "-upgrade"]
          env:
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: CALICO_NETWORKING_BACKEND
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: calico_backend
          volumeMounts:
            - mountPath: /var/lib/cni/networks
              name: host-local-net-dir
            - mountPath: /host/opt/cni/bin
              name: cni-bin-dir
          securityContext:
            privileged: true
        # This container installs the CNI binaries
        # and CNI network config file on each node.
        - name: install-cni
          image: calico/cni:v3.11.3
          command: ["/install-cni.sh"]
          env:
            # Name of the CNI config file to create.
            - name: CNI_CONF_NAME
              value: "10-calico.conflist"
            # The CNI network config to install on each node.
            - name: CNI_NETWORK_CONFIG
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: cni_network_config
            # Set the hostname based on the k8s node name.
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # CNI MTU Config variable
            - name: CNI_MTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            # Prevents the container from sleeping forever.
            - name: SLEEP
              value: "false"
          volumeMounts:
            - mountPath: /host/opt/cni/bin
              name: cni-bin-dir
            - mountPath: /host/etc/cni/net.d
              name: cni-net-dir
          securityContext:
            privileged: true
        # Adds a Flex Volume Driver that creates a per-pod Unix Domain Socket to allow Dikastes
        # to communicate with Felix over the Policy Sync API.
        - name: flexvol-driver
          image: calico/pod2daemon-flexvol:v3.11.3
          volumeMounts:
          - name: flexvol-driver-host
            mountPath: /host/driver
          securityContext:
            privileged: true
      containers:
        # Runs calico-node container on each Kubernetes node.  This
        # container programs network policy and routes on each
        # host.
        - name: calico-node
          image: calico/node:v3.11.3
          env:
            # Use Kubernetes API as the backing datastore.
            - name: DATASTORE_TYPE
              value: "kubernetes"
            # Wait for the datastore.
            - name: WAIT_FOR_DATASTORE
              value: "true"
            # Set based on the k8s node name.
            - name: NODENAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # Choose the backend to use.
            - name: CALICO_NETWORKING_BACKEND
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: calico_backend
            # Cluster type to identify the deployment type
            - name: CLUSTER_TYPE
              value: "k8s,bgp"
            # Auto-detect the BGP IP address.
            - name: IP
              value: "autodetect"
            # Enable IPIP
            - name: CALICO_IPV4POOL_IPIP
              value: "Always"
            # Set MTU for tunnel device used if ipip is enabled
            - name: FELIX_IPINIPMTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            # The default IPv4 pool to create on startup if none exists. Pod IPs will be
            # chosen from this range. Changing this value after installation will have
            # no effect. This should fall within `--cluster-cidr`.
            - name: CALICO_IPV4POOL_CIDR
              value: "10.244.0.0/16"
            # Disable file logging so `kubectl logs` works.
            - name: CALICO_DISABLE_FILE_LOGGING
              value: "true"
            # Set Felix endpoint to host default action to ACCEPT.
            - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
              value: "ACCEPT"
            # Disable IPv6 on Kubernetes.
            - name: FELIX_IPV6SUPPORT
              value: "false"
            # Set Felix logging to "info"
            - name: FELIX_LOGSEVERITYSCREEN
              value: "info"
            - name: FELIX_HEALTHENABLED
              value: "true"
          securityContext:
            privileged: true
          resources:
            requests:
              cpu: 250m
          livenessProbe:
            exec:
              command:
              - /bin/calico-node
              - -felix-live
              - -bird-live
            periodSeconds: 10
            initialDelaySeconds: 10
            failureThreshold: 6
          readinessProbe:
            exec:
              command:
              - /bin/calico-node
              - -felix-ready
              - -bird-ready
            periodSeconds: 10
          volumeMounts:
            - mountPath: /lib/modules
              name: lib-modules
              readOnly: true
            - mountPath: /run/xtables.lock
              name: xtables-lock
              readOnly: false
            - mountPath: /var/run/calico
              name: var-run-calico
              readOnly: false
            - mountPath: /var/lib/calico
              name: var-lib-calico
              readOnly: false
            - name: policysync
              mountPath: /var/run/nodeagent
      volumes:
        # Used by calico-node.
        - name: lib-modules
          hostPath:
            path: /lib/modules
        - name: var-run-calico
          hostPath:
            path: /var/run/calico
        - name: var-lib-calico
          hostPath:
            path: /var/lib/calico
        - name: xtables-lock
          hostPath:
            path: /run/xtables.lock
            type: FileOrCreate
        # Used to install CNI.
        - name: cni-bin-dir
          hostPath:
            path: /opt/cni/bin
        - name: cni-net-dir
          hostPath:
            path: /etc/cni/net.d
        # Mount in the directory for host-local IPAM allocations. This is
        # used when upgrading from host-local to calico-ipam, and can be removed
        # if not using the upgrade-ipam init container.
        - name: host-local-net-dir
          hostPath:
            path: /var/lib/cni/networks
        # Used to create per-pod Unix Domain Sockets
        - name: policysync
          hostPath:
            type: DirectoryOrCreate
            path: /var/run/nodeagent
        # Used to install Flex Volume Driver
        - name: flexvol-driver-host
          hostPath:
            type: DirectoryOrCreate
            path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
---

apiVersion: v1
kind: ServiceAccount
metadata:
  name: calico-node
  namespace: kube-system

---
# Source: calico/templates/calico-kube-controllers.yaml

# See https://github.com/projectcalico/kube-controllers
apiVersion: apps/v1
kind: Deployment
metadata:
  name: calico-kube-controllers
  namespace: kube-system
  labels:
    k8s-app: calico-kube-controllers
spec:
  # The controllers can only have a single active instance.
  replicas: 1
  selector:
    matchLabels:
      k8s-app: calico-kube-controllers
  strategy:
    type: Recreate
  template:
    metadata:
      name: calico-kube-controllers
      namespace: kube-system
      labels:
        k8s-app: calico-kube-controllers
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      nodeSelector:
        beta.kubernetes.io/os: linux
      tolerations:
        # Mark the pod as a critical add-on for rescheduling.
        - key: CriticalAddonsOnly
          operator: Exists
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      serviceAccountName: calico-kube-controllers
      priorityClassName: system-cluster-critical
      containers:
        - name: calico-kube-controllers
          image: calico/kube-controllers:v3.11.3
          env:
            # Choose which controllers to run.
            - name: ENABLED_CONTROLLERS
              value: node
            - name: DATASTORE_TYPE
              value: kubernetes
          readinessProbe:
            exec:
              command:
              - /usr/bin/check-status
              - -r
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: calico-kube-controllers
  namespace: kube-system
---
# Source: calico/templates/calico-etcd-secrets.yaml
---
# Source: calico/templates/calico-typha.yaml
---
# Source: calico/templates/configure-canal.yaml
EOF


(base) [root@kubuflow ~]# kubectl apply -f calico.yaml 
configmap/calico-config created
Warning: apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers create

3.15 Verifying the network

(base) [root@kubuflow ~]# kubectl get nodes
NAME       STATUS   ROLES                  AGE   VERSION
kubuflow   Ready    control-plane,master   13m   v1.21.5
(base) [root@kubuflow ~]#  kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-5bcd7db644-ncdh5   1/1     Running   0          114s
calico-node-9qjv8                          1/1     Running   0          114s
coredns-59d64cd4d4-574b4                   1/1     Running   0          13m
coredns-59d64cd4d4-5mr9x                   1/1     Running   0          13m
etcd-kubuflow                              1/1     Running   0          13m
kube-apiserver-kubuflow                    1/1     Running   0          13m
kube-controller-manager-kubuflow           1/1     Running   0          13m
kube-proxy-xcfcd                           1/1     Running   0          13m
kube-scheduler-kubuflow                    1/1     Running   0          13m

3.16 Remove taint

After the single-episode version of k8s is installed, the service cannot be deployed.
Because the default master cannot deploy pods and has stains, you need to remove the stains or add a new node, here is to remove the stains.

# After execution, you can see that there is an output indicating that there is a stain

(base) [root@kubuflow ~]# kubectl get node -o yaml | grep taint -A 5
    taints:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
  status:
    addresses:
    - address: 192.168.3.130

Unstain

(base) [root@kubuflow ~]# kubectl taint nodes --all node-role.kubernetes.io/master-
node/kubuflow untainted

3.17. Install the package of completion command

(base) [root@kubuflow ~]# yum -y install bash-completion  #安装补全命令的包
(base) [root@kubuflow ~]# kubectl completion bash
(base) [root@kubuflow ~]# source /usr/share/bash-completion/bash_completion
(base) [root@kubuflow ~]# kubectl completion bash >/etc/profile.d/kubectl.sh
(base) [root@kubuflow ~]# source /etc/profile.d/kubectl.sh
(base) [root@kubuflow ~]# cat  >>  /root/.bashrc <<EOF
source /etc/profile.d/kubectl.sh
EOF

3.18 Deploy and access the Kubernetes dashboard (Dashboard)

Dashboard is not deployed by default. It can be deployed with the following command:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.6.1/aio/deploy/recommended.yaml

Check if it is running

(base) [root@kubuflow ~]# kubectl get pod -n kubernetes-dashboard
NAME                                         READY   STATUS    RESTARTS   AGE
dashboard-metrics-scraper-7c857855d9-snpfs   1/1     Running   0          16m
kubernetes-dashboard-6b79449649-4kgsx        1/1     Running   0          16m

Change the ClusterIP type to NodePort, use: Access Service from outside the cluster

(base) [root@kubuflow ~]# kubectl edit svc kubernetes-dashboard -n kubernetes-dashboard


Change type: ClusterIP to type: NodePort, and use the kubectl get svc -n kubernetes-dashboard command to view the automatically generated ports after saving:


(base) [root@kubuflow ~]# kubectl get svc -n kubernetes-dashboard
NAME                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
dashboard-metrics-scraper   ClusterIP   10.98.238.142    <none>        8000/TCP        25m
kubernetes-dashboard        NodePort    10.105.207.158   <none>        443:30988/TCP   25m

As shown above, Dashboard has been exposed on port 30988/ and can now be accessed externally using https://:30988/.

create access account

cat >  dash.yaml << EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: admin-user
  namespace: kubernetes-dashboard
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: admin-user
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: admin-user
  namespace: kubernetes-dashboard
EOF

(base) [root@kubuflow ~]#  kubectl apply -f dash.yaml
serviceaccount/admin-user created
clusterrolebinding.rbac.authorization.k8s.io/admin-user created

View token token

kubectl -n kubernetes-dashboard get secret $(kubectl -n kubernetes-dashboard get sa/admin-user -o jsonpath="{.secrets[0].name}") -o go-template="{
   
   {.data.token | base64decode}}"

eyJhbGciOiJSUzI1Nxxx.....xxxxxxxxx..........pTDfnNmg

Since my host has done remote mapping, the access address here looks different from the host ip. It
should actually be https://192.168.3.130:30988

4 Kubeflow installation

4.1 Download the official installation script repository

Install version 1.6.0

(base) [root@kubuflow softwares]# wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.6.0.zip

(base) [root@kubuflow ~]# unzip v1.6.0.zip
(base) [root@kubuflow ~]# unzip v1.6.0.zip mv manifests-1.6.0/ manifests

4.2 Download and install kustomize

https://github.com/kubernetes-sigs/kustomize

curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash

If the download is slow, you can use a proxy to speed up github

(base) [root@kubuflow softwares]# curl -s "https://ghproxy.com/https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash

add to bin

 cp kustomize /bin/
 kustomize version

4.3 How to synchronize the image to dockerhub

Since the images of some components of kubeflow are foreign, it is necessary to solve the problem of pulling foreign Google images. For details, please refer to a post shared by a big guy:

The latest installation method of kubeflow domestic environment https://zhuanlan.zhihu.com/p/546677250

### 获取gcr镜像,因为我的网络只无法获取gcr.io, quay.io正常,可以根据需求修改
kustomize build example |grep 'image: gcr.io'|awk '$2 != "" { print $2}' |sort -u 
###   使用github-ci同步至个人dockerhub仓库
https://github.com/kenwoodjw/sync_gcr
修改https://github.com/kenwoodjw/sync_gcr/blob/master/images.txt 提交会触发ci同步镜像至dockerhub
可根据需求修改https://github.com/kenwoodjw/sync_gcr/blob/master/sync_image.py

4.4 Prepare sc, pv, pvc

The components of kubeflow need to be stored, so pv needs to be prepared in advance. The local disk storage method used in this experiment is stored. The process is as follows:
这里需要小心,名字和路径需要写对,按照下面步骤进行,或者根据自己创建的路径仔细修改

  1. Prepare local directory
mkdir -p /data/k8s/istio-authservice /data/k8s/katib-mysql /data/k8s/minio /data/k8s/mysql-pv-claim

Modify auth path permissions

sudo chmod -R 777 /data/k8s/istio-authservice/
  1. Write kubeflow-storage.yaml
    hostPath: path: "/data/k8s/istio-authservice"and change it to the directory created above
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: authservice
  namespace: istio-system
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/istio-authservice"

---
apiVersion: v1
kind: PersistentVolume
metadata:
  namespace: kubeflow
  name: katib-mysql
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/katib-mysql"

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: minio
  namespace: kubeflow
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/minio"

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: mysql-pv-claim
  namespace: kubeflow
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/mysql-pv-claim"

implement

kubectl apply -f kubeflow-storage.yaml

4.5 Modify the installation script to pull the image

(base) [root@kubuflow example]# cat kustomization.yaml

Modify the contents of the manifests/example/kustomization.yaml file as follows, that is, add images later, which is equivalent to synchronizing the images of Google (gcr.io, quay.io) to dockerhub:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
# Cert-Manager
- ../common/cert-manager/cert-manager/base
- ../common/cert-manager/kubeflow-issuer/base
# Istio
- ../common/istio-1-16/istio-crds/base
- ../common/istio-1-16/istio-namespace/base
- ../common/istio-1-16/istio-install/base
# OIDC Authservice
- ../common/oidc-authservice/base
# Dex
- ../common/dex/overlays/istio
# KNative
- ../common/knative/knative-serving/overlays/gateways
- ../common/knative/knative-eventing/base
- ../common/istio-1-16/cluster-local-gateway/base
# Kubeflow namespace
- ../common/kubeflow-namespace/base
# Kubeflow Roles
- ../common/kubeflow-roles/base
# Kubeflow Istio Resources
- ../common/istio-1-16/kubeflow-istio-resources/base


# Kubeflow Pipelines
- ../apps/pipeline/upstream/env/cert-manager/platform-agnostic-multi-user
# Katib
- ../apps/katib/upstream/installs/katib-with-kubeflow
# Central Dashboard
- ../apps/centraldashboard/upstream/overlays/kserve
# Admission Webhook
- ../apps/admission-webhook/upstream/overlays/cert-manager
# Jupyter Web App
- ../apps/jupyter/jupyter-web-app/upstream/overlays/istio
# Notebook Controller
- ../apps/jupyter/notebook-controller/upstream/overlays/kubeflow
# Profiles + KFAM
# - ../apps/profiles/upstream/overlays/kubeflow
# Volumes Web App
- ../apps/volumes-web-app/upstream/overlays/istio
# Tensorboards Controller
-  ../apps/tensorboard/tensorboard-controller/upstream/overlays/kubeflow
# Tensorboard Web App
-  ../apps/tensorboard/tensorboards-web-app/upstream/overlays/istio
# Training Operator
- ../apps/training-operator/upstream/overlays/kubeflow
# User namespace
- ../common/user-namespace/base

# KServe
- ../contrib/kserve/kserve
- ../contrib/kserve/models-web-app/overlays/kubeflow


images:
 - name: gcr.io/arrikto/istio/pilot:1.14.1-1-g19df463bb
   newName: kenwood/pilot
   newTag: "1.14.1-1-g19df463bb"
 - name: gcr.io/arrikto/kubeflow/oidc-authservice:28c59ef
   newName: kenwood/oidc-authservice
   newTag: "28c59ef"
 - name: gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:dc0ac2d8f235edb04ec1290721f389d2bc719ab8b6222ee86f17af8d7d2a160f
   newName: kenwood/controller
   newTag: "dc0ac2"
 - name: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:632d9d710d070efed2563f6125a87993e825e8e36562ec3da0366e2a897406c0
   newName: kenwood/cmd/mtping
   newTag: "632d9d"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:847bb97e38440c71cb4bcc3e430743e18b328ad1e168b6fca35b10353b9a2c22
   newName: kenwood/domain-mapping-webhook
   newTag: "847bb9"
 - name: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:b7faf7d253bd256dbe08f1cac084469128989cf39abbe256ecb4e1d4eb085a31
   newName: kenwood/webhook
   newTag: "b7faf7"
 - name: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:f253b82941c2220181cee80d7488fe1cefce9d49ab30bdb54bcb8c76515f7a26
   newName: kenwood/controller
   newTag: "f253b8"
 - name: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:a705c1ea8e9e556f860314fe055082fbe3cde6a924c29291955f98d979f8185e
   newName: kenwood/webhook
   newTag: "a705c1"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:93ff6e69357785ff97806945b284cbd1d37e50402b876a320645be8877c0d7b7
   newName: kenwood/activator
   newTag: "93ff6e"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:007820fdb75b60e6fd5a25e65fd6ad9744082a6bf195d72795561c91b425d016
   newName: kenwood/autoscaler
   newTag: "007820"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:75cfdcfa050af9522e798e820ba5483b9093de1ce520207a3fedf112d73a4686
   newName: kenwood/controller
   newTag: "75cfdc"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:847bb97e38440c71cb4bcc3e430743e18b328ad1e168b6fca35b10353b9a2c22
   newName: kenwood/domain-mapping-webhook
   newTag: "847bb9"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:23baa19322320f25a462568eded1276601ef67194883db9211e1ea24f21a0beb
   newName: kenwood/domain-mapping
   newTag: "23baa1"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:14415b204ea8d0567235143a6c3377f49cbd35f18dc84dfa4baa7695c2a9b53d
   newName: kenwood/queue
   newTag: "14415b"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:9084ea8498eae3c6c4364a397d66516a25e48488f4a9871ef765fa554ba483f0
   newName: kenwood/webhook
   newTag: "9084ea"
 - name: gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.3
   newName: kenwood/visualization-server
   newTag: "2.0.0-alpha.3"
 - name: gcr.io/ml-pipeline/cache-server:2.0.0-alpha.3
   newName: kenwood/cache-server
   newTag: "2.0.0-alpha.3"
 - name: gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.3
   newName: kenwood/metadata-envoy
   newTag: "2.0.0-alpha.3"
 - name: gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.3
   newName: kenwood/viewer-crd-controller
   newTag: "2.0.0-alpha.3"
 - name: gcr.io/arrikto/kubeflow/oidc-authservice:28c59ef
   newName: kenwood/oidc-authservice
   newTag: "28c59ef"

Modify yaml, add in each file belowstorageClassName: local-storage

apps/katib/upstream/components/mysql/pvc.yaml
apps/pipeline/upstream/third-party/minio/base/minio-pvc.yaml
apps/pipeline/upstream/third-party/mysql/base/mysql-pv-claim.yaml
common/oidc-authservice/base/pvc.yaml

insert image description here

4.6 One-click installation

https://github.com/kubeflow/manifests#install-with-a-single-command

(base) [root@kubuflow manifests]# pwd
/root/softwares/manifests
(base) [root@kubuflow manifests]# while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
2022/12/24 16:23:51 well-defined vars that were never replaced: kfp-app-name,kfp-app-version

After most of the pods are created, the output is as follows:

The last error error: resource mapping not found for name: “kubeflow-user-example-com” namespace: “” from “STDIN”: no matches for kind “Profile” in version “kubeflow.org/v1beta1”, we can Ignore it first, this seems to be an official kubeflow example, you can also refer to the step-by-step installation steps for details:
https://github.com/kubeflow/manifests#user-namespace
kustomize build common/user-namespace/base | kubectl apply - f-

After a while (you can play a game, wait patiently, each pod image and container creation will be pulled in the middle, so it is relatively slow), we can check the status of the pods, all of them are running, indicating a green light all the way, and you can visit kubeflow dashbord up

(base) [root@kubuflow ~]# kubectl get pods --all-namespaces 


We check the dashboard of k8s, and we can see that all pods are running normally
insert image description here

4.7 Access Kubeflow Dashboard

kubectl port-forward --address 0.0.0.0 svc/istio-ingressgateway -n istio-system 8080:80

--address 0.0.0.0It means that it can be accessed by an external host. If not added, it can only be accessed locally.
insert image description here
Default username and password:

[email protected] 
 12341234

Only http access, https has a problem

5 References

Guess you like

Origin blog.csdn.net/yanqianglifei/article/details/128432784