Easy-to-understand machine learning - detailed explanation of training models using the server

This article has participated in the "New Talent Creation Ceremony" event

Software download and installation

Required software (this is for personal use, or others)

1.xshell Use xshell to connect to the server, and perform instruction operations (other software, such as ssh can also be used) Download and install: You can find free online (recommended to support genuine), after downloading, follow the instructions to install.

2. xmanager download and installation: If you do not use the graphical function, you can not install it. Similarly, you can find a free one on the Internet (recommended to support the genuine version). After downloading, follow the instructions to install it.

3. pycharm download and installation: There are community version and professional version to choose from. The community version is free, and the general functions are supported. The professional version can be tried for one month, and a registration code is required in the follow-up. Here we only need to upload and download code and trained weights with pycharm, the functions of the community version are enough.

configure xshell

session manager

Right-click all sessions in the session manager, New -> Session

Note: The new session 1 here is a session that the author has created, and there is no such session for the first use

new session

Enter session name (or not), host and port number

set-xmanager

Set the X11 transfer in the tunnel options in SSH to Xmanager

Use xshell to connect to the server

If you are using an intranet server such as a school intranet, you need to log in to the intranet first.

Enter your user name

Enter your user nameenter password

enter passwordconnection succeeded

connection succeeded

Configure pycharm

Both the community version and the professional version can be configured, and the location may be different. Here, the professional version is used for configuration.

menu

First find and select Tools on the menu

confuguration

Select Deployment->Configuration

SFTP

Select the "+" sign -> SFTP

account password

Click '...' after SSH configurationssh configuration

Host input address, User Name input user name, PassWord input user password, if the port number is not the default 22 port number, you need to modify the port number

test

Select Save password and click Test Connection to test the connection. If it is the prompt on the figure, the connection is successful.View Remove Hpst

Follow Tools->Deployment->Browse Remote Host to open remote host

remote host

Show Remote Host

Select SFTP

Click the down arrow and select the SFTP you just configured

show

在这个目录下可以看到自己用户的文件夹,最好在自己的文件夹里面上传代码

文件的上传和下载只需要选中并拖拽即可,如:将代码上传到服务器,选中pycharm左侧的代码,将它拖拽到右侧需要上传到的文件夹中,如果要下载训练好的权值只需在右侧选中服务器中的权值文件拖拽到左侧文件夹即可

安装依赖包

一般情况下刚刚配置好的服务器环境的依赖包都不够满足代码的运行,所以我们训练前要先安装依赖包。

pip超时问题:有些服务器平时是不联网的,此时服务器运行的命令相当于服务器本地离线运行 解决方式:在pip前输入这段命令。 注:这段命令不能直接使用,很多地方需要改为自己环境的信息,如:写有“服务器地址”的位置,改成自己的服务器地址,(X11; Ubuntu; Linux x86_64; rv:81.0) 中也要改成服务器linux对应的版本,这些信息一般是有命令可以查到的,查不到的可以去问服务器管理员。

curl 'http://服务器地址/0.htm' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2' --compressed -H 'Content-Type: application/x-www-form-urlencoded' -H 'Origin: http://服务器地址' -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Referer: http://服务器地址/0.htm' -H 'Upgrade-Insecure-Requests: 1' --data-raw 'R3=1&v6ip=&DDDDD=2018405A122&upass=260613&save_me=1&0MKKey=123'
复制代码

联网后就可以正常使用pip了,但是因为很多包是外网,下载将会非常慢,所以应该使用镜像源,同时也要注意pip时包的名称 这里以opencv为例

pip install opencv-python -i https://pypi.tuna.tsinghua.edu.cn/simple
复制代码

包名后面的-i https://pypi.tuna.tsinghua.edu.cn/simple就是使用了清华镜像

安装Cuda出现的问题

如何安装符合linux版本和tensorflow的方法网上已经有很多了,这里重点说明一下cudnn的问题。 由于我环境变量配置的问题导致有一些cudnn的lib在使用gpu加速时不能正常引用。这种情况下程序会报错说明在某个路径下找不到某某文件。我的做法是搜索到该文件,把该文件复制到报错的文件路径中 搜索文件的命令

locate 文件名字
复制代码

训练时间的问题

因为各个服务器都有时间限制,大概30分钟没有进行操作就会自动断开,而这个时间对于很多模型都是不够用的,所以我们要想办法延长时间,或者永久防止断开。 1.脚本 在网页右键->检查->控制台,在控制台输入脚本进行防止断开(其原理是每隔一段时间就进行一次网页操作) 这里以Colab和kaggle为例 Colab适用的脚本

function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);

function closeButton(){
    console.log("close"); 
    document.querySelector("body > colab-dialog > paper-dialog > colab-sessions-dialog").shadowRoot.querySelector("#footer > div > paper-button.dismiss").click() 
}
setInterval(ConnectButton,60000);

复制代码

kaggle适用代码

function closeButton(){
    console.log("close"); 
   document.querySelector("#root > div > div > div.AppView-sc-16eb2j.kZXkZl > div.App_Body-sc-16c8j4p.hxOBfv > div.Layout_Body-sc-6piylv.bXAYPy > div > div > div > div.ToolbarContainer_Body-sc-2h8iu7.fhvgBU > button").click() 
}
setInterval(closeButton,60000);

function closeButton(){
    console.log("close"); 
    document.querySelector("#root > div > div > div.AppView-sc-16eb2j.kZXkZl > div.App_Body-sc-16c8j4p.hxOBfv > div.Layout_Body-sc-6piylv.bXAYPy > div > div > div > div.ToolbarContainer_Body-sc-2h8iu7.fhvgBU > div.DetailedStatus_Body-sc-zfwb95.fMzpPO > button > i").click() 
}
setInterval(closeButton,60000);
复制代码

Disadvantages: It can be seen from the above that the code of each different web page is different, which will be more troublesome, so we will introduce the use of the nohup command to achieve offline operation.

nohup command: The nohup command should be run under the command line of xshell or ssh. Enter nohup before the command to be executed to run the command offline, and the command to execute the python file is the same. When executing this command, there is no screen output. The screen output when nohup is not added is output to the nohup.out file generated in the current folder. After nohup runrunning , you can close xshell, or even shut down, and the server will run by itself.

nohup.out takes up too much memory

When the number of training models is too many, nohup.out will store a lot of characters, which will take up a lot of space. My approach here is to delete the nohup.out file after confirming that it has started running, so that it can also run normally (not sure what the principle is, so I am not sure that every server supports this), so that you don't have to worry about nohup. out too big problem.

kill nohup process

First you need to find the process and enter the command:

ps aux | less 
复制代码

Running result: It processshould be noted that the second column is the ID (PID) of the process, and the last line is the command (COMMAND). If it is a python program, the command is similar to: python program name parameter and then kill the process

kill -9 PID
复制代码

Guess you like

Origin juejin.im/post/7078464404924137503