Cree una plataforma de aprendizaje automático Kubeflow desde cero

1 Introducción a Kubeflow

1.1 ¿Qué es Kubeflow?

Una introducción del sitio web oficial: El proyecto Kubeflow se compromete a hacer que la implementación de flujos de trabajo de aprendizaje automático (ML) en Kubernetes sea simple, portátil y escalable. El objetivo de Kubeflow no es recrear otros servicios, sino proporcionar una forma sencilla de implementar los mejores sistemas de código abierto para ML en diferentes infraestructuras. Los desarrolladores deberían poder ejecutar Kubeflow en cualquier lugar donde se esté ejecutando Kubernetes.

Desde la introducción en el sitio web oficial, podemos ver que Kubeflow y Kubernetes son inseparables. En general, Kubeflow es una plataforma de flujo de trabajo de aprendizaje automático basada en Kubernetes de código abierto de Google, que integra una gran cantidad de herramientas de aprendizaje automático, como el entorno jupyterlab para experimentos interactivos, katib para el ajuste de hiperparámetros y el flujo de trabajo de canalización Controlled argo workflow, etc. Como una colección de "gran caja de herramientas", kubeflow proporciona una gran cantidad de herramientas opcionales para los desarrolladores de aprendizaje automático y también proporciona herramientas factibles para la implementación de proyectos de aprendizaje automático.

1.2 Fondo de flujo de Kube

Kubernetes era originalmente una plataforma de contenedores utilizada para administrar aplicaciones sin estado, pero en los últimos dos años, cada vez más empresas la han utilizado para ejecutar varias cargas de trabajo, especialmente alquimia de aprendizaje automático. Varias empresas de IA o departamentos de IA de empresas de Internet intentarán ejecutar TensorFlow, Caffe, MXNet y otras tareas de aprendizaje distribuidas en Kubernetes, lo que plantea nuevos desafíos para Kubernetes.

En primer lugar, las tareas de aprendizaje automático distribuido generalmente implican dos tipos diferentes de trabajo: servidores de parámetros (en adelante, PS) y nodos de trabajo (en adelante, trabajadores). Además, las tareas de aprendizaje en diferentes campos tienen diferentes requisitos para PS y trabajadores, lo que se refleja en la dificultad de configuración en Kubernetes. Tomando TensorFlow como ejemplo, las tareas de aprendizaje distribuidas de TensorFlow generalmente inician múltiples PS y múltiples trabajadores, y en la mejor práctica proporcionada por TensorFlow, cada trabajador y PS requiere que se pasen diferentes parámetros de línea de comando.

En segundo lugar, el programador predeterminado de Kubernetes no es compatible con la programación de tareas de aprendizaje automático. Si el problema anterior solo es problemático en la fase de aplicación e implementación, entonces el problema de la baja utilización de recursos causada por la programación o la reducción de la eficiencia de las tareas de aprendizaje automático merece una atención especial. Las tareas de aprendizaje automático tienen requisitos informáticos y de red relativamente altos. En términos generales, todos los trabajadores usarán GPU para el entrenamiento y, para obtener un mejor soporte de red, el PS y los trabajadores de la misma tarea de aprendizaje automático deben colocarse en la misma máquina. o en máquinas adyacentes con mejores redes reducirá el tiempo requerido para el entrenamiento.

En respuesta a estos problemas, nació el proyecto Kubeflow, que utiliza TensorFlow como el primer marco compatible y define un nuevo tipo de recurso en Kubernetes: TFJob, que es la abreviatura de TensorFlow Job. Con este tipo de recurso, los ingenieros que usan TensorFlow para la capacitación en aprendizaje automático ya no necesitan escribir configuraciones complicadas, solo necesitan determinar la cantidad de PS y trabajadores y la entrada y salida de datos y registros de acuerdo con su comprensión del negocio. Una misión de entrenamiento.

En una frase: Kubeflow es una pila de tecnología de aprendizaje automático componible, portátil y escalable creada para Kubernetes.

Lo anterior es del artículo kubeflow-Introduction https://www.jianshu.com/p/192f22a0b857, esta introducción explica muy bien el pasado y el presente de kubeflow, y tiene una comprensión más profunda de kubeflow.

1.3 Kubeflow y aprendizaje automático

Kubeflow 是一个面向希望构建和进行 ML 任务的数据科学家的平台。Kubeflow 还适用于希望将 ML 系统部署到各种环境以进行开发、测试和生产级服务的 ML 工程师和运营团队。

Kubeflow 是 Kubernetes的 ML 工具包。

下图显示了 Kubeflow 作为在 Kubernetes 基础之上构建机器学习系统组件的平台:

kubeflow是一个胶水项目,它把诸多对机器学习的支持,比如模型训练,超参数训练,模型部署等进行组合并已容器化的方式进行部署,提供整个流程各个系统的高可用及方便的进行扩展部署了 kubeflow的用户就可以利用它进行不同的机器学习任务。

下图按顺序展示了机器学习工作流。工作流末尾的箭头指向流程表示机器学习任务是一个逐渐迭代的过程:

在实验阶段,您根据初始假设开发模型,并迭代测试和更新模型以产生您正在寻找的结果:

  • 确定希望 ML 系统解决的问题;
  • 收集和分析训练 ML 模型所需的数据;
  • 选择 ML 框架和算法,并对模型的初始版本进行编码;
  • 试验数据并训练您的模型。
  • 调整模型超参数以确保最高效的处理和最准确的结果。

在生产阶段,您部署一个执行以下过程的系统:

  • 将数据转换为训练系统需要的格式;
  • 为确保模型在训练和预测期间表现一致,转换过程在实验和生产阶段必须相同。
  • 训练 ML 模型。
  • 为在线预测或以批处理模式运行的模型提供服务。
  • 监控模型的性能,并将结果提供给您的流程以调整或重新训练模型。

ML 工作流中的 Kubeflow 组件如下图所示

1.4 核心组件

构成 Kubeflow 的核心组件,官网这里https://www.kubeflow.org/docs/components/有具体介绍,下面是一个我画的思维导图:

2 Kubeflow安装引导

2.1 常用链接

  • 官方定制化安装指南仓库:https://github.com/kubeflow/manifests
  • kubeflow官方仓库:https://github.com/kubeflow/
  • kubernetes官网:https://kubernetes.io/zh-cn/
  • github代理加速:https://ghproxy.com/

2.2 安装环境

安装环境:

  • 系统版本
cat /etc/redhat-release

CentOS Linux release 7.9.2009 (Core)
  • 运行内存
free -h
              total        used        free      shared  buff/cache   available
Mem:           110G        3.4G        105G        3.8M        891M        105G
Swap:          4.0G          0B        4.0G
  • cpu
cat /proc/cpuinfo | grep name | sort | uniq
model name	: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l
42

  • gpu
nvidia-smi
Sat Dec 24 13:01:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:06.0 Off |                    0 |
| N/A   38C    P0    25W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:07.0 Off |                    0 |
| N/A   34C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2.3 前置环境

安装kubeflow需要的前置环境主要包括以下工具:

  • Kubernetes :最高1.21
  • kustomize :3.2.0
  • kubectl

https://github.com/kubeflow/manifests#prerequisites

3 Kubernetes 安装

k8s集群由Master节点和Node(Worker)节点组成,在这里我们只用1台机器,安装kubernetes。

3.1 查看ip

(base) [root@server-szry1agd ~]# ip add

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:44:6c:3c brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.130/22 brd 192.168.3.255 scope global noprefixroute dynamic eth0
       valid_lft 80254sec preferred_lft 80254sec
    inet6 fe80::f816:3eff:fe44:6c3c/64 scope link 
       valid_lft forever preferred_lft forever

3.2 修改主机名称

这一步不是必须的,我看到有的文章里面讲到主机名称不能有下划线

(base) [root@server-szry1agd ~]# hostnamectl set-hostname kubuflow && bash

修改前后对比
inserte la descripción de la imagen aquí

3.3 添加host

这里需要改成自己的ip和主机名称

(base) [root@kubuflow ~]# cat >> /etc/hosts << EOF 
> 192.168.3.130  kubuflow 
> EOF

查看hosts

(base) [root@kubuflow ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
0.0.0.0     server-szry1agd.novalocal
192.168.3.130  kubuflow 

3.4 关闭防火墙,关闭selinux

(base) [root@kubuflow ~]# systemctl stop firewalld
(base) [root@kubuflow ~]# systemctl disable firewalld
(base) [root@kubuflow ~]# sed -i 's/enforcing/disabled/' /etc/selinux/config # 永久
(base) [root@kubuflow ~]# setenforce 0  # 临时
setenforce: SELinux is disabled

3.5 关闭swap

(base) [root@kubuflow ~]# swapoff -a
(base) [root@kubuflow ~]# sed -i 's/.*swap.*/#&/' /etc/fstab

3.6 转发 IPv4 并让 iptables 看到桥接流量

通过运行 lsmod | grep br_netfilter 来验证 br_netfilter 模块是否已加载。 若要显式加载此模块,请运行 sudo modprobe br_netfilter。 为了让 Linux 节点的 iptables 能够正确查看桥接流量,请确认 sysctl 配置中的 net.bridge.bridge-nf-call-iptables 设置为 1。

cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

# 设置所需的 sysctl 参数,参数在重新启动后保持不变
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF

# 应用 sysctl 参数而不重新启动
sudo sysctl --system

3.7 时间同步

(base) [root@kubuflow ~]# yum install ntpdate -y
(base) [root@kubuflow ~]# ntpdate time.windows.com
24 Dec 14:21:55 ntpdate[18177]: adjust time server 52.231.114.183 offset 0.003717 sec

3.8 安装docker

wget https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo -O /etc/yum.repos.d/docker-ce.repo
 yum -y install docker-ce
 systemctl enable docker && systemctl start docker && systemctl status docker

安装成功

(base) [root@kubuflow ~]# docker --version
Docker version 20.10.22, build 3a2c30b
(base) [root@kubuflow ~]# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
(base) [root@kubuflow ~]# 

3.9 docker添加国内镜像源

(base) [root@kubuflow ~]# cat > /etc/docker/daemon.json << EOF
> {
>     "registry-mirrors": [
>         "http://hub-mirror.c.163.com",
>         "https://docker.mirrors.ustc.edu.cn",
>         "https://registry.docker-cn.com"
>     ]
> }
> EOF
(base) [root@kubuflow ~]# # 使配置生效
(base) [root@kubuflow ~]# systemctl daemon-reload
(base) [root@kubuflow ~]# 
(base) [root@kubuflow ~]# # 重启Docker
(base) [root@kubuflow ~]# systemctl restart docker

3.10 添加kubernetes的yum源

(base) [root@kubuflow ~]# cat > /etc/yum.repos.d/kubernetes.repo << EOF
> [kubernetes]
> name=Kubernetes
> baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64
> enabled=1
> gpgcheck=0
> repo_gpgcheck=0
> gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg
> https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
> EOF

3.11 安装kubeadm,kubelet 和kubectl

(base) [root@kubuflow ~]# yum -y install kubelet-1.21.5-0 kubeadm-1.21.5-0 kubectl-1.21.5-0 
(base) [root@kubuflow ~]# systemctl enable kubelet

3.12 部署Kubernetes Master

(base) [root@kubuflow ~]#  kubeadm init --apiserver-advertise-address=192.168.3.130 --image-repository registry.aliyuncs.com/google_containers  --kubernetes-version v1.21.5  --service-cidr=10.96.0.0/12  --pod-network-cidr=10.244.0.0/16 --ignore-preflight-errors=all

参数说明:

  • –apiserver-advertise-address=192.168.3.130
    这个参数就是master主机的IP地址,例如我的Master主机的IP是:192.168.3.130,也是我们在2.4.1看到的ip地址
  • –image-repository registry.aliyuncs.com/google_containers
    这个是镜像地址,由于国外地址无法访问,故使用的阿里云仓库地址:repository
    registry.aliyuncs.com/google_containers
  • –kubernetes-version=v1.21.5 这个参数是下载的k8s软件版本号
  • –service-cidr=10.96.0.0/12 这个参数后的IP地址直接就套用10.96.0.0/12
    ,以后安装时也套用即可,不要更改
  • –pod-network-cidr=10.244.0.0/16
    k8s内部的pod节点之间网络可以使用的IP段,不能和service-cidr写一样,如果不知道怎么配,就先用这个10.244.0.0/16
  • –ignore-preflight-errors=all 添加这个会忽略错误

执行语句后,看到如下的信息说明就安装成功了。

[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.3.130:6443 --token nupk90.vnoqbfgexf8d2lhp \
	--discovery-token-ca-cert-hash sha256:715fac4463bd6b5b4de53e9356002eed12652fa8c6def12789ccb5d6f73fefaa 
(base) [root@kubuflow ~]# 

3.13 创建kube配置文件

(base) [root@kubuflow ~]# mkdir -p $HOME/.kube
(base) [root@kubuflow ~]# sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
(base) [root@kubuflow ~]# sudo chown $(id -u):$(id -g) $HOME/.kube/config

(base) [root@kubuflow ~]# kubectl get nodes
NAME       STATUS     ROLES                  AGE     VERSION
kubuflow   NotReady   control-plane,master   5m45s   v1.21.5

3.14 安装Pod 网络插件(CNI)

cat > calico.yaml  << EOF
---
# Source: calico/templates/calico-config.yaml
# This ConfigMap is used to configure a self-hosted Calico installation.
kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-config
  namespace: kube-system
data:
  # Typha is disabled.
  typha_service_name: "none"
  # Configure the backend to use.
  calico_backend: "bird"

  # Configure the MTU to use
  veth_mtu: "1440"

  # The CNI network configuration to install on each node.  The special
  # values in this config will be automatically populated.
  cni_network_config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "log_level": "info",
          "datastore_type": "kubernetes",
          "nodename": "__KUBERNETES_NODE_NAME__",
          "mtu": __CNI_MTU__,
          "ipam": {
              "type": "calico-ipam"
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "kubeconfig": "__KUBECONFIG_FILEPATH__"
          }
        },
        {
          "type": "portmap",
          "snat": true,
          "capabilities": {"portMappings": true}
        }
      ]
    }

---
# Source: calico/templates/kdd-crds.yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: felixconfigurations.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: FelixConfiguration
    plural: felixconfigurations
    singular: felixconfiguration
---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: ipamblocks.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPAMBlock
    plural: ipamblocks
    singular: ipamblock

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: blockaffinities.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: BlockAffinity
    plural: blockaffinities
    singular: blockaffinity

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: ipamhandles.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPAMHandle
    plural: ipamhandles
    singular: ipamhandle

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: ipamconfigs.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPAMConfig
    plural: ipamconfigs
    singular: ipamconfig

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: bgppeers.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: BGPPeer
    plural: bgppeers
    singular: bgppeer

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: bgpconfigurations.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: BGPConfiguration
    plural: bgpconfigurations
    singular: bgpconfiguration

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: ippools.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPPool
    plural: ippools
    singular: ippool

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: hostendpoints.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: HostEndpoint
    plural: hostendpoints
    singular: hostendpoint

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: clusterinformations.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: ClusterInformation
    plural: clusterinformations
    singular: clusterinformation

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: globalnetworkpolicies.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: GlobalNetworkPolicy
    plural: globalnetworkpolicies
    singular: globalnetworkpolicy

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: globalnetworksets.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: GlobalNetworkSet
    plural: globalnetworksets
    singular: globalnetworkset

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: networkpolicies.crd.projectcalico.org
spec:
  scope: Namespaced
  group: crd.projectcalico.org
  version: v1
  names:
    kind: NetworkPolicy
    plural: networkpolicies
    singular: networkpolicy

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: networksets.crd.projectcalico.org
spec:
  scope: Namespaced
  group: crd.projectcalico.org
  version: v1
  names:
    kind: NetworkSet
    plural: networksets
    singular: networkset
---
# Source: calico/templates/rbac.yaml

# Include a clusterrole for the kube-controllers component,
# and bind it to the calico-kube-controllers serviceaccount.
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-kube-controllers
rules:
  # Nodes are watched to monitor for deletions.
  - apiGroups: [""]
    resources:
      - nodes
    verbs:
      - watch
      - list
      - get
  # Pods are queried to check for existence.
  - apiGroups: [""]
    resources:
      - pods
    verbs:
      - get
  # IPAM resources are manipulated when nodes are deleted.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ippools
    verbs:
      - list
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
      - ipamblocks
      - ipamhandles
    verbs:
      - get
      - list
      - create
      - update
      - delete
  # Needs access to update clusterinformations.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - clusterinformations
    verbs:
      - get
      - create
      - update
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-kube-controllers
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: calico-kube-controllers
subjects:
- kind: ServiceAccount
  name: calico-kube-controllers
  namespace: kube-system
---
# Include a clusterrole for the calico-node DaemonSet,
# and bind it to the calico-node serviceaccount.
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-node
rules:
  # The CNI plugin needs to get pods, nodes, and namespaces.
  - apiGroups: [""]
    resources:
      - pods
      - nodes
      - namespaces
    verbs:
      - get
  - apiGroups: [""]
    resources:
      - endpoints
      - services
    verbs:
      # Used to discover service IPs for advertisement.
      - watch
      - list
      # Used to discover Typhas.
      - get
  - apiGroups: [""]
    resources:
      - nodes/status
    verbs:
      # Needed for clearing NodeNetworkUnavailable flag.
      - patch
      # Calico stores some configuration information in node annotations.
      - update
  # Watch for changes to Kubernetes NetworkPolicies.
  - apiGroups: ["networking.k8s.io"]
    resources:
      - networkpolicies
    verbs:
      - watch
      - list
  # Used by Calico for policy information.
  - apiGroups: [""]
    resources:
      - pods
      - namespaces
      - serviceaccounts
    verbs:
      - list
      - watch
  # The CNI plugin patches pods/status.
  - apiGroups: [""]
    resources:
      - pods/status
    verbs:
      - patch
  # Calico monitors various CRDs for config.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - globalfelixconfigs
      - felixconfigurations
      - bgppeers
      - globalbgpconfigs
      - bgpconfigurations
      - ippools
      - ipamblocks
      - globalnetworkpolicies
      - globalnetworksets
      - networkpolicies
      - networksets
      - clusterinformations
      - hostendpoints
      - blockaffinities
    verbs:
      - get
      - list
      - watch
  # Calico must create and update some CRDs on startup.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ippools
      - felixconfigurations
      - clusterinformations
    verbs:
      - create
      - update
  # Calico stores some configuration information on the node.
  - apiGroups: [""]
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  # These permissions are only requried for upgrade from v2.6, and can
  # be removed after upgrade or on fresh installations.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - bgpconfigurations
      - bgppeers
    verbs:
      - create
      - update
  # These permissions are required for Calico CNI to perform IPAM allocations.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
      - ipamblocks
      - ipamhandles
    verbs:
      - get
      - list
      - create
      - update
      - delete
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ipamconfigs
    verbs:
      - get
  # Block affinities must also be watchable by confd for route aggregation.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
    verbs:
      - watch
  # The Calico IPAM migration needs to get daemonsets. These permissions can be
  # removed if not upgrading from an installation using host-local IPAM.
  - apiGroups: ["apps"]
    resources:
      - daemonsets
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: calico-node
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: calico-node
subjects:
- kind: ServiceAccount
  name: calico-node
  namespace: kube-system

---
# Source: calico/templates/calico-node.yaml
# This manifest installs the calico-node container, as well
# as the CNI plugins and network config on
# each master and worker node in a Kubernetes cluster.
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: calico-node
  namespace: kube-system
  labels:
    k8s-app: calico-node
spec:
  selector:
    matchLabels:
      k8s-app: calico-node
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        k8s-app: calico-node
      annotations:
        # This, along with the CriticalAddonsOnly toleration below,
        # marks the pod as a critical add-on, ensuring it gets
        # priority scheduling and that its resources are reserved
        # if it ever gets evicted.
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      nodeSelector:
        beta.kubernetes.io/os: linux
      hostNetwork: true
      tolerations:
        # Make sure calico-node gets scheduled on all nodes.
        - effect: NoSchedule
          operator: Exists
        # Mark the pod as a critical add-on for rescheduling.
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoExecute
          operator: Exists
      serviceAccountName: calico-node
      # Minimize downtime during a rolling upgrade or deletion; tell Kubernetes to do a "force
      # deletion": https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods.
      terminationGracePeriodSeconds: 0
      priorityClassName: system-node-critical
      initContainers:
        # This container performs upgrade from host-local IPAM to calico-ipam.
        # It can be deleted if this is a fresh installation, or if you have already
        # upgraded to use calico-ipam.
        - name: upgrade-ipam
          image: calico/cni:v3.11.3
          command: ["/opt/cni/bin/calico-ipam", "-upgrade"]
          env:
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: CALICO_NETWORKING_BACKEND
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: calico_backend
          volumeMounts:
            - mountPath: /var/lib/cni/networks
              name: host-local-net-dir
            - mountPath: /host/opt/cni/bin
              name: cni-bin-dir
          securityContext:
            privileged: true
        # This container installs the CNI binaries
        # and CNI network config file on each node.
        - name: install-cni
          image: calico/cni:v3.11.3
          command: ["/install-cni.sh"]
          env:
            # Name of the CNI config file to create.
            - name: CNI_CONF_NAME
              value: "10-calico.conflist"
            # The CNI network config to install on each node.
            - name: CNI_NETWORK_CONFIG
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: cni_network_config
            # Set the hostname based on the k8s node name.
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # CNI MTU Config variable
            - name: CNI_MTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            # Prevents the container from sleeping forever.
            - name: SLEEP
              value: "false"
          volumeMounts:
            - mountPath: /host/opt/cni/bin
              name: cni-bin-dir
            - mountPath: /host/etc/cni/net.d
              name: cni-net-dir
          securityContext:
            privileged: true
        # Adds a Flex Volume Driver that creates a per-pod Unix Domain Socket to allow Dikastes
        # to communicate with Felix over the Policy Sync API.
        - name: flexvol-driver
          image: calico/pod2daemon-flexvol:v3.11.3
          volumeMounts:
          - name: flexvol-driver-host
            mountPath: /host/driver
          securityContext:
            privileged: true
      containers:
        # Runs calico-node container on each Kubernetes node.  This
        # container programs network policy and routes on each
        # host.
        - name: calico-node
          image: calico/node:v3.11.3
          env:
            # Use Kubernetes API as the backing datastore.
            - name: DATASTORE_TYPE
              value: "kubernetes"
            # Wait for the datastore.
            - name: WAIT_FOR_DATASTORE
              value: "true"
            # Set based on the k8s node name.
            - name: NODENAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # Choose the backend to use.
            - name: CALICO_NETWORKING_BACKEND
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: calico_backend
            # Cluster type to identify the deployment type
            - name: CLUSTER_TYPE
              value: "k8s,bgp"
            # Auto-detect the BGP IP address.
            - name: IP
              value: "autodetect"
            # Enable IPIP
            - name: CALICO_IPV4POOL_IPIP
              value: "Always"
            # Set MTU for tunnel device used if ipip is enabled
            - name: FELIX_IPINIPMTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            # The default IPv4 pool to create on startup if none exists. Pod IPs will be
            # chosen from this range. Changing this value after installation will have
            # no effect. This should fall within `--cluster-cidr`.
            - name: CALICO_IPV4POOL_CIDR
              value: "10.244.0.0/16"
            # Disable file logging so `kubectl logs` works.
            - name: CALICO_DISABLE_FILE_LOGGING
              value: "true"
            # Set Felix endpoint to host default action to ACCEPT.
            - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
              value: "ACCEPT"
            # Disable IPv6 on Kubernetes.
            - name: FELIX_IPV6SUPPORT
              value: "false"
            # Set Felix logging to "info"
            - name: FELIX_LOGSEVERITYSCREEN
              value: "info"
            - name: FELIX_HEALTHENABLED
              value: "true"
          securityContext:
            privileged: true
          resources:
            requests:
              cpu: 250m
          livenessProbe:
            exec:
              command:
              - /bin/calico-node
              - -felix-live
              - -bird-live
            periodSeconds: 10
            initialDelaySeconds: 10
            failureThreshold: 6
          readinessProbe:
            exec:
              command:
              - /bin/calico-node
              - -felix-ready
              - -bird-ready
            periodSeconds: 10
          volumeMounts:
            - mountPath: /lib/modules
              name: lib-modules
              readOnly: true
            - mountPath: /run/xtables.lock
              name: xtables-lock
              readOnly: false
            - mountPath: /var/run/calico
              name: var-run-calico
              readOnly: false
            - mountPath: /var/lib/calico
              name: var-lib-calico
              readOnly: false
            - name: policysync
              mountPath: /var/run/nodeagent
      volumes:
        # Used by calico-node.
        - name: lib-modules
          hostPath:
            path: /lib/modules
        - name: var-run-calico
          hostPath:
            path: /var/run/calico
        - name: var-lib-calico
          hostPath:
            path: /var/lib/calico
        - name: xtables-lock
          hostPath:
            path: /run/xtables.lock
            type: FileOrCreate
        # Used to install CNI.
        - name: cni-bin-dir
          hostPath:
            path: /opt/cni/bin
        - name: cni-net-dir
          hostPath:
            path: /etc/cni/net.d
        # Mount in the directory for host-local IPAM allocations. This is
        # used when upgrading from host-local to calico-ipam, and can be removed
        # if not using the upgrade-ipam init container.
        - name: host-local-net-dir
          hostPath:
            path: /var/lib/cni/networks
        # Used to create per-pod Unix Domain Sockets
        - name: policysync
          hostPath:
            type: DirectoryOrCreate
            path: /var/run/nodeagent
        # Used to install Flex Volume Driver
        - name: flexvol-driver-host
          hostPath:
            type: DirectoryOrCreate
            path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
---

apiVersion: v1
kind: ServiceAccount
metadata:
  name: calico-node
  namespace: kube-system

---
# Source: calico/templates/calico-kube-controllers.yaml

# See https://github.com/projectcalico/kube-controllers
apiVersion: apps/v1
kind: Deployment
metadata:
  name: calico-kube-controllers
  namespace: kube-system
  labels:
    k8s-app: calico-kube-controllers
spec:
  # The controllers can only have a single active instance.
  replicas: 1
  selector:
    matchLabels:
      k8s-app: calico-kube-controllers
  strategy:
    type: Recreate
  template:
    metadata:
      name: calico-kube-controllers
      namespace: kube-system
      labels:
        k8s-app: calico-kube-controllers
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      nodeSelector:
        beta.kubernetes.io/os: linux
      tolerations:
        # Mark the pod as a critical add-on for rescheduling.
        - key: CriticalAddonsOnly
          operator: Exists
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      serviceAccountName: calico-kube-controllers
      priorityClassName: system-cluster-critical
      containers:
        - name: calico-kube-controllers
          image: calico/kube-controllers:v3.11.3
          env:
            # Choose which controllers to run.
            - name: ENABLED_CONTROLLERS
              value: node
            - name: DATASTORE_TYPE
              value: kubernetes
          readinessProbe:
            exec:
              command:
              - /usr/bin/check-status
              - -r
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: calico-kube-controllers
  namespace: kube-system
---
# Source: calico/templates/calico-etcd-secrets.yaml
---
# Source: calico/templates/calico-typha.yaml
---
# Source: calico/templates/configure-canal.yaml
EOF


(base) [root@kubuflow ~]# kubectl apply -f calico.yaml 
configmap/calico-config created
Warning: apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers create

3.15 验证网络

(base) [root@kubuflow ~]# kubectl get nodes
NAME       STATUS   ROLES                  AGE   VERSION
kubuflow   Ready    control-plane,master   13m   v1.21.5
(base) [root@kubuflow ~]#  kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-5bcd7db644-ncdh5   1/1     Running   0          114s
calico-node-9qjv8                          1/1     Running   0          114s
coredns-59d64cd4d4-574b4                   1/1     Running   0          13m
coredns-59d64cd4d4-5mr9x                   1/1     Running   0          13m
etcd-kubuflow                              1/1     Running   0          13m
kube-apiserver-kubuflow                    1/1     Running   0          13m
kube-controller-manager-kubuflow           1/1     Running   0          13m
kube-proxy-xcfcd                           1/1     Running   0          13m
kube-scheduler-kubuflow                    1/1     Running   0          13m

3.16 取消污点

单集版的k8s安装后, 无法部署服务。
因为默认master不能部署pod,有污点, 需要去掉污点或者新增一个node,这里是去除污点。

#执行后看到有输出说明有污点

(base) [root@kubuflow ~]# kubectl get node -o yaml | grep taint -A 5
    taints:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
  status:
    addresses:
    - address: 192.168.3.130

取消污点

(base) [root@kubuflow ~]# kubectl taint nodes --all node-role.kubernetes.io/master-
node/kubuflow untainted

3.17.安装补全命令的包

(base) [root@kubuflow ~]# yum -y install bash-completion  #安装补全命令的包
(base) [root@kubuflow ~]# kubectl completion bash
(base) [root@kubuflow ~]# source /usr/share/bash-completion/bash_completion
(base) [root@kubuflow ~]# kubectl completion bash >/etc/profile.d/kubectl.sh
(base) [root@kubuflow ~]# source /etc/profile.d/kubectl.sh
(base) [root@kubuflow ~]# cat  >>  /root/.bashrc <<EOF
source /etc/profile.d/kubectl.sh
EOF

3.18 部署和访问 Kubernetes 仪表板(Dashboard)

默认情况下不会部署 Dashboard。可以通过以下命令部署:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.6.1/aio/deploy/recommended.yaml

查看是否在运行

(base) [root@kubuflow ~]# kubectl get pod -n kubernetes-dashboard
NAME                                         READY   STATUS    RESTARTS   AGE
dashboard-metrics-scraper-7c857855d9-snpfs   1/1     Running   0          16m
kubernetes-dashboard-6b79449649-4kgsx        1/1     Running   0          16m

将ClusterIP类型改为NodePort,使用 : 从集群外部访问Service

(base) [root@kubuflow ~]# kubectl edit svc kubernetes-dashboard -n kubernetes-dashboard


type: ClusterIP修改为type: NodePort,保存后使用kubectl get svc -n kubernetes-dashboard命令来查看自动生产的端口:


(base) [root@kubuflow ~]# kubectl get svc -n kubernetes-dashboard
NAME                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
dashboard-metrics-scraper   ClusterIP   10.98.238.142    <none>        8000/TCP        25m
kubernetes-dashboard        NodePort    10.105.207.158   <none>        443:30988/TCP   25m

如上所示,Dashboard已经在30988/端口上公开,现在可以在外部使用https://:30988/进行访问。

创建访问账号

cat >  dash.yaml << EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: admin-user
  namespace: kubernetes-dashboard
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: admin-user
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: admin-user
  namespace: kubernetes-dashboard
EOF

(base) [root@kubuflow ~]#  kubectl apply -f dash.yaml
serviceaccount/admin-user created
clusterrolebinding.rbac.authorization.k8s.io/admin-user created

查看token令牌

kubectl -n kubernetes-dashboard get secret $(kubectl -n kubernetes-dashboard get sa/admin-user -o jsonpath="{.secrets[0].name}") -o go-template="{
   
   {.data.token | base64decode}}"

eyJhbGciOiJSUzI1Nxxx.....xxxxxxxxx..........pTDfnNmg

由于我主机做了远程映射,所里这里访问地址看起来和主机ip不一样
实际应该是https://192.168.3.130:30988

4 Kubeflow安装

4.1 下载官方安装脚本仓库

安装1.6.0版本

(base) [root@kubuflow softwares]# wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.6.0.zip

(base) [root@kubuflow ~]# unzip v1.6.0.zip
(base) [root@kubuflow ~]# unzip v1.6.0.zip mv manifests-1.6.0/ manifests

4.2 下载安装kustomize

https://github.com/kubernetes-sigs/kustomize

curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash

如果下载比较慢的话,可以使用代理进行github加速

(base) [root@kubuflow softwares]# curl -s "https://ghproxy.com/https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash

添加到bin

 cp kustomize /bin/
 kustomize version

4.3 镜像同步至dockerhub方式

由于kubeflow有些组件的镜像是国外的,所以需要解决国外谷歌镜像拉取问题,具体可以参考一个大佬分享的帖子:

kubeflow国内环境最新安装方式 https://zhuanlan.zhihu.com/p/546677250

### 获取gcr镜像,因为我的网络只无法获取gcr.io, quay.io正常,可以根据需求修改
kustomize build example |grep 'image: gcr.io'|awk '$2 != "" { print $2}' |sort -u 
###   使用github-ci同步至个人dockerhub仓库
https://github.com/kenwoodjw/sync_gcr
修改https://github.com/kenwoodjw/sync_gcr/blob/master/images.txt 提交会触发ci同步镜像至dockerhub
可根据需求修改https://github.com/kenwoodjw/sync_gcr/blob/master/sync_image.py

4.4 准备sc、pv、pvc

kubeflow的组件需要存储,所以需要提前准备好pv,本次实验存储采用的本地磁盘存储的方式。流程如下:
这里需要小心,名字和路径需要写对,按照下面步骤进行,或者根据自己创建的路径仔细修改

  1. 准备本地目录
mkdir -p /data/k8s/istio-authservice /data/k8s/katib-mysql /data/k8s/minio /data/k8s/mysql-pv-claim

修改auth路径权限

sudo chmod -R 777 /data/k8s/istio-authservice/
  1. 编写kubeflow-storage.yaml
    hostPath: path: "/data/k8s/istio-authservice" 改成上面各自创建的目录
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: authservice
  namespace: istio-system
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/istio-authservice"

---
apiVersion: v1
kind: PersistentVolume
metadata:
  namespace: kubeflow
  name: katib-mysql
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/katib-mysql"

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: minio
  namespace: kubeflow
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/minio"

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: mysql-pv-claim
  namespace: kubeflow
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/mysql-pv-claim"

执行

kubectl apply -f kubeflow-storage.yaml

4.5 修改安装脚本拉取镜像

(base) [root@kubuflow example]# cat kustomization.yaml

将manifests/example/kustomization.yaml文件内容修改如下,就是后面添加images,这个相当于把谷歌(gcr.io, quay.io)的镜像同步到了dockerhub:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
# Cert-Manager
- ../common/cert-manager/cert-manager/base
- ../common/cert-manager/kubeflow-issuer/base
# Istio
- ../common/istio-1-16/istio-crds/base
- ../common/istio-1-16/istio-namespace/base
- ../common/istio-1-16/istio-install/base
# OIDC Authservice
- ../common/oidc-authservice/base
# Dex
- ../common/dex/overlays/istio
# KNative
- ../common/knative/knative-serving/overlays/gateways
- ../common/knative/knative-eventing/base
- ../common/istio-1-16/cluster-local-gateway/base
# Kubeflow namespace
- ../common/kubeflow-namespace/base
# Kubeflow Roles
- ../common/kubeflow-roles/base
# Kubeflow Istio Resources
- ../common/istio-1-16/kubeflow-istio-resources/base


# Kubeflow Pipelines
- ../apps/pipeline/upstream/env/cert-manager/platform-agnostic-multi-user
# Katib
- ../apps/katib/upstream/installs/katib-with-kubeflow
# Central Dashboard
- ../apps/centraldashboard/upstream/overlays/kserve
# Admission Webhook
- ../apps/admission-webhook/upstream/overlays/cert-manager
# Jupyter Web App
- ../apps/jupyter/jupyter-web-app/upstream/overlays/istio
# Notebook Controller
- ../apps/jupyter/notebook-controller/upstream/overlays/kubeflow
# Profiles + KFAM
# - ../apps/profiles/upstream/overlays/kubeflow
# Volumes Web App
- ../apps/volumes-web-app/upstream/overlays/istio
# Tensorboards Controller
-  ../apps/tensorboard/tensorboard-controller/upstream/overlays/kubeflow
# Tensorboard Web App
-  ../apps/tensorboard/tensorboards-web-app/upstream/overlays/istio
# Training Operator
- ../apps/training-operator/upstream/overlays/kubeflow
# User namespace
- ../common/user-namespace/base

# KServe
- ../contrib/kserve/kserve
- ../contrib/kserve/models-web-app/overlays/kubeflow


images:
 - name: gcr.io/arrikto/istio/pilot:1.14.1-1-g19df463bb
   newName: kenwood/pilot
   newTag: "1.14.1-1-g19df463bb"
 - name: gcr.io/arrikto/kubeflow/oidc-authservice:28c59ef
   newName: kenwood/oidc-authservice
   newTag: "28c59ef"
 - name: gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:dc0ac2d8f235edb04ec1290721f389d2bc719ab8b6222ee86f17af8d7d2a160f
   newName: kenwood/controller
   newTag: "dc0ac2"
 - name: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:632d9d710d070efed2563f6125a87993e825e8e36562ec3da0366e2a897406c0
   newName: kenwood/cmd/mtping
   newTag: "632d9d"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:847bb97e38440c71cb4bcc3e430743e18b328ad1e168b6fca35b10353b9a2c22
   newName: kenwood/domain-mapping-webhook
   newTag: "847bb9"
 - name: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:b7faf7d253bd256dbe08f1cac084469128989cf39abbe256ecb4e1d4eb085a31
   newName: kenwood/webhook
   newTag: "b7faf7"
 - name: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:f253b82941c2220181cee80d7488fe1cefce9d49ab30bdb54bcb8c76515f7a26
   newName: kenwood/controller
   newTag: "f253b8"
 - name: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:a705c1ea8e9e556f860314fe055082fbe3cde6a924c29291955f98d979f8185e
   newName: kenwood/webhook
   newTag: "a705c1"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:93ff6e69357785ff97806945b284cbd1d37e50402b876a320645be8877c0d7b7
   newName: kenwood/activator
   newTag: "93ff6e"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:007820fdb75b60e6fd5a25e65fd6ad9744082a6bf195d72795561c91b425d016
   newName: kenwood/autoscaler
   newTag: "007820"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:75cfdcfa050af9522e798e820ba5483b9093de1ce520207a3fedf112d73a4686
   newName: kenwood/controller
   newTag: "75cfdc"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:847bb97e38440c71cb4bcc3e430743e18b328ad1e168b6fca35b10353b9a2c22
   newName: kenwood/domain-mapping-webhook
   newTag: "847bb9"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:23baa19322320f25a462568eded1276601ef67194883db9211e1ea24f21a0beb
   newName: kenwood/domain-mapping
   newTag: "23baa1"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:14415b204ea8d0567235143a6c3377f49cbd35f18dc84dfa4baa7695c2a9b53d
   newName: kenwood/queue
   newTag: "14415b"
 - name: gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:9084ea8498eae3c6c4364a397d66516a25e48488f4a9871ef765fa554ba483f0
   newName: kenwood/webhook
   newTag: "9084ea"
 - name: gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.3
   newName: kenwood/visualization-server
   newTag: "2.0.0-alpha.3"
 - name: gcr.io/ml-pipeline/cache-server:2.0.0-alpha.3
   newName: kenwood/cache-server
   newTag: "2.0.0-alpha.3"
 - name: gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.3
   newName: kenwood/metadata-envoy
   newTag: "2.0.0-alpha.3"
 - name: gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.3
   newName: kenwood/viewer-crd-controller
   newTag: "2.0.0-alpha.3"
 - name: gcr.io/arrikto/kubeflow/oidc-authservice:28c59ef
   newName: kenwood/oidc-authservice
   newTag: "28c59ef"

Modifique yaml, agregue cada archivo a continuaciónstorageClassName: local-storage

apps/katib/upstream/components/mysql/pvc.yaml
apps/pipeline/upstream/third-party/minio/base/minio-pvc.yaml
apps/pipeline/upstream/third-party/mysql/base/mysql-pv- reclamación.yaml
common/oidc-authservice/base/pvc.yaml

inserte la descripción de la imagen aquí

4.6 Instalación con un clic

https://github.com/kubeflow/manifests#install-with-a-single-command

(base) [root@kubuflow manifests]# pwd
/root/softwares/manifests
(base) [root@kubuflow manifests]# while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
2022/12/24 16:23:51 well-defined vars that were never replaced: kfp-app-name,kfp-app-version

Después de crear la mayoría de los pods, el resultado es el siguiente:

El último error: no se encontró la asignación de recursos para el nombre: "kubeflow-user-example-com" espacio de nombres: "" de "STDIN": no hay coincidencias para el tipo "Perfil" en la versión "kubeflow.org/v1beta1", podemos ignorar primero, este parece ser un ejemplo oficial de kubeflow, también puede consultar los pasos de instalación paso a paso para obtener más detalles:
https://github.com/kubeflow/manifests#user-namespace
kustomize build common/user-namespace /base | kubectl apply - f-

Después de un tiempo (puedes jugar un juego, esperar pacientemente, cada imagen de pod y la creación de contenedores se extraerán en el medio, por lo que es relativamente lento), podemos verificar el estado de los pods, todos ellos se están ejecutando, lo que indica un luz verde hasta el final, y puede visitar kubeflow dashbord arriba

(base) [root@kubuflow ~]# kubectl get pods --all-namespaces 


Verificamos el panel de control de k8s y podemos ver que todos los pods funcionan normalmente
inserte la descripción de la imagen aquí

4.7 Acceder al panel de control de Kubeflow

kubectl port-forward --address 0.0.0.0 svc/istio-ingressgateway -n istio-system 8080:80

--address 0.0.0.0Significa que puede ser accedido por un host externo.Si no se agrega, solo se puede acceder localmente.Usuario
inserte la descripción de la imagen aquí
y contraseña predeterminados:

[email protected] 
 12341234

Solo acceso http, https tiene un problema

5 Referencias

Supongo que te gusta

Origin blog.csdn.net/yanqianglifei/article/details/128432784
Recomendado
Clasificación