Eight, cluster manager (ClusterManager)
A section is specially added to introduce the cluster manager. The cluster manager is also an abstract object, similar to the AbstractArray object.
Abstract objects in Julia are equivalent to "templates". This template can add custom internal variables to derive many sub-type objects. Both abstract arrays and cluster managers are abstract objects. When we refer to ClusterManager below, remember that it is an abstract object, don't be confused.
In Julia, the networked structure composed of the main process and the Worker is called a "cluster", which is a subtype of the ClusterManager abstract object, whether it is a single machine or multiple machines. We are already familiar with the concepts of the main process and Worker. The master process always has PID=1, and only on this process can add or remove other processes. From this perspective, Julia's cluster management is one-way, and the cluster manager is equivalent to the assistant of the main process. However, any two processes can communicate with each other, so communication is equal.
The cluster can be created on a single machine or on multiple machines. To avoid confusion, we call a cluster on a single machine a "single machine cluster", and a "multi-machine cluster" on multiple machines. The case of a single machine is roughly equivalent to virtualizing a single machine into multiple machines, so it is also called a "virtual cluster".
Julia's cluster manager includes a set of functions and commands for managing single-machine clusters and multi-machine clusters. The former is a summary of the various process operations we have already known. The latter is the same operation, but some special commands are added to manage multiple machines.
The function of the so-called cluster manager has three aspects:
- Start Worker in the cluster environment.
- Manage the life cycle of each Worker, such as sending interrupt signals.
- (Optional) Provide data transmission.
The connection between processes in a multi-machine cluster is based on the built-in TCP/IP transmission protocol. The internal principle of the connection process is as follows:
- First, call the one with the
ClusterManager
object on the main processaddprocs
. addprocs
Call an appropriate method to start the required number of Workers on the appropriate machine.- Each worker will dial a free port and write its own host and port information
stdout
. - The cluster manager reads each Worker
stdout
and passes it to the main process. - The main process parses the information and sets up the TCP/IP connection of the Worker.
- Each process in the cluster will be informed of the connection information of other processes.
- Each process is connected to all processes with a smaller PID, forming a pairwise network in this way.
These processes are all implicit, and the explicit operation that the user really needs to do is addprocs(参数)
, divided into three situations:
- If the parameter is an integer
n
, then it will build an
single-machine cluster with a Worker. If it is empty, it defaultsn=Sys.CPU_THREADS
. At this time, the total number of processes is equal to the number of CPU logical threads + 1. - If the parameter is
hostname::Array
an array of hostnames of each machine, then it will build a multi-machine cluster. In the official standard library document, it is written againaddprocs(machines)
, whichmachines
is a vector composed of "machine parameters", and each machine parameter corresponds to starting a machine. The format of machine parameters is: stringmachine_spec
or tuple(machine_spec,count)
. Specifically, the character stringmachine_spec=[user@]hostname[:port] [bind_addr[:port]]
, which[]
represents can be omitted.hostname
It is the only must write, theuser
default is the current user, and theport
default is the standard SSH port. Together, they provide the connection information between the main process and the target Worker. If it is writtenbind_addr[:port]
, then other Workers will be able tobind_addr[:port]
connect to this Worker, which means that this is a custom additional address and port. The second elementcount
of the tuple parameter is an integer, indicating how many Workers will be created on the machine. If you ordercount=:auto
(note the colon), then a Worker equal to the total number of logical threads of the machine's CPU will be created. - If the parameter is
manager::ClusterManager
, then a custom cluster will be created. Heremanager::ClusterManager
is a custom ClusterManager. The example given in the official document is:ClusterManagers.jl
a so-called "Beowulf cluster" is constructed through a custom ClusterManager in the extension package . I'm afraid I have to look through the source code for how to customize it.
Tips: You can enter
hostname
commands in the Windows or Linux terminal to view the hostname of the machine.
Finally, after the cluster is created, you can follow the commands in the previous sections to operate each process, whether it is single or multiple.
There is also a --machinefile
command to connect to each machine when starting Julia REPL. Usage is to type in the terminal:
julia --machinefile 文件路径
# 或简写为
julia -m 文件路径
Which 文件路径
is the path to a custom file. This file is created by myself, the content is the hostname or IP address of each machine, and only one machine per line. For example, put a machinefile
file named in the directory of the julia executable file and then:
julia --m ~/machinefile
Where the machinefile
contents of the file are two computer hostname:
host1
host2
Now we have two ways to create a multi-machine cluster: addprocs(machines)
and --machinefile
. I read a blog post complaining about the latter way, and the point of view is probably Jiangzi:
--machinefile
After the machine is connected, a Worker equal to the number of CPU logical threads is automatically created on each machine. It allows to add some additional custom information, but the degree of freedom is not high, such as the network topology and the location of Julia executable files cannot be customized. If you want to completely control the cluster creation process, you should use it addprocs(machines)
. The specific approach is:
- First create a
startupfile.jl
file andaddprocs(machines)
write it in the file. - In the terminal
julia -L startupfile.jl
. It will be executed immediately when Julia REPL startsstartupfile.jl
.
And each time manually addprocs()
or to addprocs()
write at the beginning of the code, this kind of approach is obviously much easier when there are multiple programs to be parallel. In startupfile.jl
where we can finely set cluster parameters. The following is a detailed demonstration:
Suppose a cluster contains 4 servers. In addition to the main server, the other three remote servers are:
- host1, 24 cores (actually a logical thread, the following uses "core" to refer to "logical thread".)
- host2, 12 cores
- host3, 8 cores
machinefile
It says:
host1
host2
host3
Assuming that the number of workers required is equal to the number of cores, then it is straightforward julia -m ~/machinefile
. If you are facing a supercomputer or a very large cluster composed of a lot of servers, you can try to generate a machinefile
file. Please consult MPI users for specific generation methods.
However, if you want to build a :master_slave
cluster with a topology, so that all Workers can only communicate with the main process but not with each other (this is often done in many clusters), you can write a startupfile.jl
file with the content:
- First start the Worker located on the main server, for example
addprocs(4)
. Note that this step has nothing to do with the next step. If you don't need Worker on the main server, you can even omit this step. (Of course, you usually don't want the main server to be idle.) - Add 3 remote servers to the cluster:
for host in ["host1","host2","host3"]
addprocs(host;topology=:master_slave) # 注意分号和冒号
end
# 或者写作
addprocs(["host1", "host2", "host3"]; topology=:master_slave)
In this way, a Worker equal to the number of cores is automatically created on each server. You can also specify the number of Workers one by one, just change it hostname
to (hostname,数量)
:
addprocs([("host1", 24), ("host2", 12), ("host3", 8)]; topology=:master_slave)
If there are many servers, it will be troublesome to write so many hostnames. You can write all the hostnames in a machinefile
file first , or use the MPI user method to automatically export a machinefile
file (if the exported file contains only hostnames and no workers, you can only fill it up by yourself. If you don't fill it, the default is equal to the number of cores.) , And then read line by line:
addprocs(collect(eachline("~/machinefile")); topology=:master_slave)
In addition, addprocs()
there are several custom parameters, see the official documentation for details.
In short, the julia -L startupfile.jl
method can be used to define the clusters finely and conveniently, which is worth recommending.