(Author: Chen Yujue)
Cube Studio currently includes traditional machine learning templates, 400+ AI models, welcome to private message to learn more!
In the process of using cube studio for model training or reasoning, we sometimes find that there are no templates that meet our requirements. At this time, we need to create templates and build pipelines for direct use in similar modeling or monitoring scenarios in the future. , to facilitate reuse and scheduling.
The following is an example of building a pipeline for random forest modeling to record how to use cube studio to build a pipeline.
1. Code construction
The difference between writing the random forest modeling code and the usual modeling is that the code here needs to have input parameters. This is to be connected with our template, because the template needs input parameters to run better. The code is not shown here, everyone knows these, just show the input parameters that are different from the usual modeling.
if __name__ == "__main__":
arg_parser = argparse.ArgumentParser("train lr launcher")
arg_parser.add_argument('--train_dataset', type=str, help="训练数据集来源", default='')
arg_parser.add_argument('--val_dataset', type=str, help="评估数据集名称", default='')
arg_parser.add_argument('--feature_columns', type=str, help="特征列", default='')
arg_parser.add_argument('--label_columns', type=str, help="标签列", default='')
arg_parser.add_argument('--save_model_dir', type=str, help="模型地址", default='')
arg_parser.add_argument('--save_val_dir', type=str, help="模型训练集和测试集评估结果地址", default='')
arg_parser.add_argument('--inference_dataset', type=str, help="推理数据集名称", default='')
arg_parser.add_argument('--result_path', type=str, help="推理结果保存地址", default='')
How to do the training process, the key is these input parameters, which need to be considered clearly at the beginning of designing the template, which parameters need to be passed in to the template, and which results need to be output.
After running in the notebook, you can proceed to the next step.
2. Image build
1. git clone cube studio project;
2. Under the cube-studio/job-template/job/ folder, create a new randomforest folder, put the py file in this folder, and name it launcher.py;
3. Under the same folder as above, build another three files, build.sh, Dockerfile, README.md.
Where build.sh is to build and push the image
#!/bin/bash
set -ex
docker build -t ccr.ccs.tencentyun.com/cube-studio/lightgbm:20230428 -f job/lightgbm/Dockerfile .
docker push ccr.ccs.tencentyun.com/cube-studio/lightgbm:20230428
Dockerfile also varies from person to person. The main thing to change is which packages you need to install and which folder to use launcher.py
FROM python:3.9
ENV TZ Asia/Shanghai
ENV DEBIAN_FRONTEND noninteractive
RUN /usr/local/bin/python -m pip install --upgrade pip
RUN pip install pysnooper psutil requests numpy scikit-learn pandas pyinstaller tinyaes joblib argparse
#-i https://pypi.tuna.tsinghua.edu.cn/simple/
#http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
COPY job/random_forest/launcher.py /app/
WORKDIR /app
ENV PYTHONPATH=/app:$PYTHONPATH
RUN pyinstaller --onefile --key=kaiqiao launcher.py && cp dist/launcher ./ && rm -rf launcher.py launcher.spec build dist
ENTRYPOINT ["./launcher"]
README.md is used to write input parameters, which is used to set the front-end display of the template
# randomforest 模板
镜像:ccr.ccs.tencentyun.com/cube-studio/randomforest:20230427
参数
```bash
{
"训练相关参数": {
"--train_dataset": {
"type": "str",
"item_type": "str",
"label": "训练数据集",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "训练数据集",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--save_val_dir": {
"type": "str",
"item_type": "str",
"label": "训练集和测试集acc输出文件地址",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "训练集和测试集acc输出文件地址,txt文件",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--label_columns": {
"type": "str",
"item_type": "str",
"label": "标签列",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "标签列,逗号分割",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--save_model_dir": {
"type": "str",
"item_type": "str",
"label": "模型保存目录",
"require": 1,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "模型保存目录",
"editable": 1,
"condition": "",
"sub_args": {}
} ,
"--feature_columns": {
"type": "str",
"item_type": "str",
"label": "特征列,逗号分隔",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "特征列,逗号分隔",
"editable": 1,
"condition": "",
"sub_args": {}
}
},
"推理相关参数": {
"--result_path": {
"type": "str",
"item_type": "str",
"label": "推理结果保存路径",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "推理结果保存路径",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--inference_dataset": {
"type": "str",
"item_type": "str",
"label": "推理数据集",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "推理数据集",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--save_model_dir": {
"type": "str",
"item_type": "str",
"label": "模型保存目录",
"require": 1,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "模型保存目录",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--label_columns": {
"type": "str",
"item_type": "str",
"label": "标签列",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "标签列,逗号分割",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--feature_columns": {
"type": "str",
"item_type": "str",
"label": "特征列,逗号分隔",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "特征列,逗号分隔",
"editable": 1,
"condition": "",
"sub_args": {}
}
},
"验证相关参数": {
"--val_dataset": {
"type": "str",
"item_type": "str",
"label": "验证数据集",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "验证数据集",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--save_model_dir": {
"type": "str",
"item_type": "str",
"label": "模型保存目录",
"require": 1,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "模型保存目录",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--label_columns": {
"type": "str",
"item_type": "str",
"label": "标签列",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "标签列,逗号分割",
"editable": 1,
"condition": "",
"sub_args": {}
},
"--feature_columns": {
"type": "str",
"item_type": "str",
"label": "特征列,逗号分隔",
"require": 0,
"choice": [],
"range": "",
"default": "",
"placeholder": "",
"describe": "特征列,逗号分隔",
"editable": 1,
"condition": "",
"sub_args": {}
}
}
}
```
4. Install Docker and do the following
cd /data/k8s/kubeflow/pipeline/workspace/admin/cube-studio/job-template
sh job/random_forest/build.sh
3. Image Management
Add the successfully generated image just now to the image management, and the image name is the image name in build.sh.
4. Template addition
Add task templates in model training
It is mainly to select the image and set the startup parameters of the template, that is, the bunch of parameters set in README.md
5. Create a pipeline
Enter the homepage of the page and select New Pipeline
For the template on the left, drag and drop according to the requirements, click the task node, set the running parameters, and you can run it
It works like this
If there is a bug, click the node to view the log and troubleshoot the problem
That's it!