After eloquently wrote 5 blogs, he tried to apply the theory to practice and encountered some problems. This article is a record and reflection on the problems encountered in actual combat. In order to facilitate understanding, the code is simplified, and the principle remains unchanged. Updates from time to time, the date and content of each update are ranked first, too long will be divided into multiple articles.
For the sake of simplicity, the @everywhere
broadcast objects are referred to as "broadcast objects", such as broadcast variables, broadcast functions, etc. Readers should already understand the difference between "sharing" and "broadcasting". In addition, @time
the first run will be slower, so run it several times.
Jun 24, 2019
After testing today, I found that if the structure array is referenced in the parallelized loop body, the time-consuming will increase by two orders of magnitude. The conclusion is to avoid using structure arrays in parallel programs.
Jun 23, 2019
I want to do something like this: create a shared array W
and f
modify it in parallel in a function W
. code show as below:
using Distributed
using SharedArrays
addprocs(4-nprocs())
println("Running ",nprocs()," processes")
t = 2. ; nx=1000; ny=1000
W = SharedArray{Float64}((nx,ny),init=A->(A=zeros(nx,ny)))
function f()
@time @sync @distributed for i=1:nx
for j=1:ny
W[i,j] = W[i,j] + t
end
end
end
f()
rmprocs(2,3,4)
Explain this code line by line-
addprocs(4-nprocs())
By default nprocs()=1
, 3 remote workers are enabled. Open the task manager at this time, you will see: The
first one is a resident terminal and does not participate in calculations. The second is the main process, whether it is parallel or not, it will start when there is a computing task. The remaining three are remote workers.
t = 2. ; nx=1000; ny=1000
Declare three variables, the first one adds a decimal point, so that Julia is automatically recognized as a floating point number.
W = SharedArray{Float64}((nx,ny),init=A->(A=zeros(nx,ny)))
Declare a shared array and initialize it to zero, and store it on the main process by default, so in subsequent parallel calculations, staring at the task manager will see that the main process completes faster than the three remote workers. As shown:
function f()
@time @sync @distributed for i=1:nx
for j=1:ny
W[i,j] = W[i,j] + t
end
end
end
There are several points to explain about this function:
f
You can take no parameters, and all variables and arrays are automatically inherited into the local domain of the function. If there is a parameter, the parameter will follow the normal parameter transfer method, and the rest will still be inherited automatically. For example, amended to:
function f(nx)
for i=1:nx
for j=1:ny
W[i,j] = W[i,j] + t
end
end
end
W1 = f(nx)
W2 = f(nx+1)
Will see W1
normal operation, and W2
report an error.
- Go back to the original code.
@distributed
Already introduced. Julia's multi-layer for loop can be abbreviated as afor i=1:m, j=1:n, k=1:p
form, but it@distributed
can only identify the outermost layer, so the outermost and inner layers must be separated and written as:
@distributed for i=1:m
for j=1:n, k=1:p
<Expr>
end
end
Since Julia reads the array column by column, that is, i
the speed of traversing the first indicator is significantly faster than other indicators, so it is best to put the indicator i
on the innermost side, and divide other indicators first to maintain i
the integrity of the indicator . change into:
function f()
@time @sync @distributed for j=1:ny
for i=1:nx
W[i,j] = W[i,j] + t
end
end
end
f()
The original code consumption 0.316919 seconds (185.12 k allocations: 9.060 MiB)
, after modification 0.262387 seconds (185.10 k allocations: 9.058 MiB)
, the memory consumption is almost the same, but the speed is faster.
- When designing a parallel program, it is natural to think about whether to use broadcast variables. The above example shows that
@distributed
the variables in the three different positions of the structurenx, ny, t
do not need to be broadcast, so the variables in the structure that are not@distributed
modified by multiple processes need not be broadcast. So if you want to modify a variable, should it be broadcast? Let's look at the following example:
function f()
@time @sync @distributed for j=1:ny
for i=1:nx
@everywhere t += 1
W[i,j] = W[i,j] + t
end
end
end
f()
If you remove one of them, an @everywhere
error will be reported, proving that broadcasting is feasible. However, you will find that the calculation time is greatly extended because it @everywhere
is a remote call command and it is time-consuming to execute it repeatedly. If you move it to the front, like this:
@everywhere t = 2
function f()
@time @sync @distributed for j=1:ny
for i=1:nx
t += 1
W[i,j] = W[i,j] + t
end
end
end
f()
The system will report an error. So what is a reasonable method? The answer is to t
change to a parameter, like this:
function f(t)
@time @sync @distributed for j=1:ny
for i=1:nx
t += 1
W[i,j] = W[i,j] + t
end
end
end
f(t)
You will see that the calculation time increases very little. This approach does not require broadcasting. but! ! ! When you print W
it, you will find that the calculation result has changed. For example, if you want to consistently modify t
and add to W
it, take nx=3; ny=2
the situation as an example, and get W
:
julia> W
3×2 SharedArray{Float64,2}:
3.0 3.0
3.0 3.0
3.0 3.0
If you use the above method of passing parameters, you will get:
julia> W
3×2 SharedArray{Float64,2}:
3.0 3.0
4.0 4.0
5.0 5.0
The indicated t
modification i
is superimposed along the dimension of the indicator . And the @everywhere t+=1
method will get:
julia> W
3×2 SharedArray{Float64,2}:
4.0 4.0
6.0 6.0
8.0 8.0
This is even more exaggerated. t
The modification is also superimposed i
in the j
direction every time the direction is superimposed. Where is the problem? Obviously, there is a problem with the position of the expression. Change it to the following:
function f2(t)
@time @sync @distributed for j=1:ny
t+=1
for i=1:nx
W[i,j] = W[i,j] + t
end
end
end
The correct result was obtained. The conclusion is: @distributed
the outermost loop of the structure is separated, independent of each other, and variable modifications will not overlap. However, the inner loop is still in accordance with the general program loop rules, and will superimpose the modification. @distributed
Keep this in mind when using it.
@sync
It is to ensure that all processes have completed their tasks before continuing to go down (the next statement is@time
). If you don't add it@sync
, you will see@time
that a result is quickly returned0.012252 seconds (8.00 k allocations: 421.867 KiB)
. At this time, each process in the task manager is still calculating. It can be seen that the@time
return is only the cost of initiating a remote call. Setting thenx
sumny
larger will make it more obvious. If you print when part of the process has not completed the calculationW
, you will see that some of its elements have not changed. Since the shared array is stored in the main process by default, the main process will generally be completed first, and then the rest of the processes will be completed almost simultaneously, as shown in the figure:
rmprocs(2,3,4)
Finally, the redundant processes must be closed like this, otherwise the system will add new memory to the existing processes when the code is executed again, resulting in a doubling of the memory occupied, and running it a few times will be overwhelming. After closing the remote process as shown below:
but remember not to try to close the main process, otherwise the system will refuse to execute the rmprocs()
command.