Julia parallel computing notes (5)

Eight, cluster manager (ClusterManager)

A section is specially added to introduce the cluster manager. The cluster manager is also an abstract object, similar to the AbstractArray object.

Abstract objects in Julia are equivalent to "templates". This template can add custom internal variables to derive many sub-type objects. Both abstract arrays and cluster managers are abstract objects. When we refer to ClusterManager below, remember that it is an abstract object, don't be confused.

In Julia, the networked structure composed of the main process and the Worker is called a "cluster", which is a subtype of the ClusterManager abstract object, whether it is a single machine or multiple machines. We are already familiar with the concepts of the main process and Worker. The master process always has PID=1, and only on this process can add or remove other processes. From this perspective, Julia's cluster management is one-way, and the cluster manager is equivalent to the assistant of the main process. However, any two processes can communicate with each other, so communication is equal.

The cluster can be created on a single machine or on multiple machines. To avoid confusion, we call a cluster on a single machine a "single machine cluster", and a "multi-machine cluster" on multiple machines. The case of a single machine is roughly equivalent to virtualizing a single machine into multiple machines, so it is also called a "virtual cluster".

Julia's cluster manager includes a set of functions and commands for managing single-machine clusters and multi-machine clusters. The former is a summary of the various process operations we have already known. The latter is the same operation, but some special commands are added to manage multiple machines.

The function of the so-called cluster manager has three aspects:

  1. Start Worker in the cluster environment.
  2. Manage the life cycle of each Worker, such as sending interrupt signals.
  3. (Optional) Provide data transmission.

The connection between processes in a multi-machine cluster is based on the built-in TCP/IP transmission protocol. The internal principle of the connection process is as follows:

  • First, call the one with the ClusterManagerobject on the main process addprocs.
  • addprocsCall an appropriate method to start the required number of Workers on the appropriate machine.
  • Each worker will dial a free port and write its own host and port information stdout.
  • The cluster manager reads each Worker stdoutand passes it to the main process.
  • The main process parses the information and sets up the TCP/IP connection of the Worker.
  • Each process in the cluster will be informed of the connection information of other processes.
  • Each process is connected to all processes with a smaller PID, forming a pairwise network in this way.

These processes are all implicit, and the explicit operation that the user really needs to do is addprocs(参数), divided into three situations:

  • If the parameter is an integer n, then it will build a nsingle-machine cluster with a Worker. If it is empty, it defaults n=Sys.CPU_THREADS. At this time, the total number of processes is equal to the number of CPU logical threads + 1.
  • If the parameter is hostname::Arrayan array of hostnames of each machine, then it will build a multi-machine cluster. In the official standard library document, it is written again addprocs(machines), which machinesis a vector composed of "machine parameters", and each machine parameter corresponds to starting a machine. The format of machine parameters is: string machine_specor tuple (machine_spec,count). Specifically, the character string machine_spec=[user@]hostname[:port] [bind_addr[:port]], which []represents can be omitted. hostnameIt is the only must write, the userdefault is the current user, and the portdefault is the standard SSH port. Together, they provide the connection information between the main process and the target Worker. If it is written bind_addr[:port], then other Workers will be able to bind_addr[:port]connect to this Worker, which means that this is a custom additional address and port. The second element countof the tuple parameter is an integer, indicating how many Workers will be created on the machine. If you order count=:auto(note the colon), then a Worker equal to the total number of logical threads of the machine's CPU will be created.
  • If the parameter is manager::ClusterManager, then a custom cluster will be created. Here manager::ClusterManageris a custom ClusterManager. The example given in the official document is: ClusterManagers.jla so-called "Beowulf cluster" is constructed through a custom ClusterManager in the extension package . I'm afraid I have to look through the source code for how to customize it.

Tips: You can enter hostnamecommands in the Windows or Linux terminal to view the hostname of the machine.

Finally, after the cluster is created, you can follow the commands in the previous sections to operate each process, whether it is single or multiple.

There is also a --machinefilecommand to connect to each machine when starting Julia REPL. Usage is to type in the terminal:

julia --machinefile 文件路径
# 或简写为
julia -m 文件路径

Which 文件路径is the path to a custom file. This file is created by myself, the content is the hostname or IP address of each machine, and only one machine per line. For example, put a machinefilefile named in the directory of the julia executable file and then:

julia --m ~/machinefile

Where the machinefilecontents of the file are two computer hostname:

host1
host2

Now we have two ways to create a multi-machine cluster: addprocs(machines)and --machinefile. I read a blog post complaining about the latter way, and the point of view is probably Jiangzi:

--machinefileAfter the machine is connected, a Worker equal to the number of CPU logical threads is automatically created on each machine. It allows to add some additional custom information, but the degree of freedom is not high, such as the network topology and the location of Julia executable files cannot be customized. If you want to completely control the cluster creation process, you should use it addprocs(machines). The specific approach is:

  1. First create a startupfile.jlfile and addprocs(machines)write it in the file.
  2. In the terminal julia -L startupfile.jl. It will be executed immediately when Julia REPL starts startupfile.jl.

And each time manually addprocs()or to addprocs()write at the beginning of the code, this kind of approach is obviously much easier when there are multiple programs to be parallel. In startupfile.jlwhere we can finely set cluster parameters. The following is a detailed demonstration:

Suppose a cluster contains 4 servers. In addition to the main server, the other three remote servers are:

  • host1, 24 cores (actually a logical thread, the following uses "core" to refer to "logical thread".)
  • host2, 12 cores
  • host3, 8 cores

machinefileIt says:

host1
host2
host3

Assuming that the number of workers required is equal to the number of cores, then it is straightforward julia -m ~/machinefile. If you are facing a supercomputer or a very large cluster composed of a lot of servers, you can try to generate a machinefilefile. Please consult MPI users for specific generation methods.

However, if you want to build a :master_slavecluster with a topology, so that all Workers can only communicate with the main process but not with each other (this is often done in many clusters), you can write a startupfile.jlfile with the content:

  • First start the Worker located on the main server, for example addprocs(4). Note that this step has nothing to do with the next step. If you don't need Worker on the main server, you can even omit this step. (Of course, you usually don't want the main server to be idle.)
  • Add 3 remote servers to the cluster:
for host in ["host1","host2","host3"]
	addprocs(host;topology=:master_slave)  # 注意分号和冒号
end
# 或者写作
addprocs(["host1", "host2", "host3"]; topology=:master_slave)

In this way, a Worker equal to the number of cores is automatically created on each server. You can also specify the number of Workers one by one, just change it hostnameto (hostname,数量):

addprocs([("host1", 24), ("host2", 12), ("host3", 8)]; topology=:master_slave)

If there are many servers, it will be troublesome to write so many hostnames. You can write all the hostnames in a machinefilefile first , or use the MPI user method to automatically export a machinefilefile (if the exported file contains only hostnames and no workers, you can only fill it up by yourself. If you don't fill it, the default is equal to the number of cores.) , And then read line by line:

addprocs(collect(eachline("~/machinefile")); topology=:master_slave)

In addition, addprocs()there are several custom parameters, see the official documentation for details.

In short, the julia -L startupfile.jlmethod can be used to define the clusters finely and conveniently, which is worth recommending.

Guess you like

Origin blog.csdn.net/iamzhtr/article/details/91947784