Julia High Performance Computing Practice Record (1)

After eloquently wrote 5 blogs, he tried to apply the theory to practice and encountered some problems. This article is a record and reflection on the problems encountered in actual combat. In order to facilitate understanding, the code is simplified, and the principle remains unchanged. Updates from time to time, the date and content of each update are ranked first, too long will be divided into multiple articles.

For the sake of simplicity, the @everywherebroadcast objects are referred to as "broadcast objects", such as broadcast variables, broadcast functions, etc. Readers should already understand the difference between "sharing" and "broadcasting". In addition, @timethe first run will be slower, so run it several times.

Jun 24, 2019

After testing today, I found that if the structure array is referenced in the parallelized loop body, the time-consuming will increase by two orders of magnitude. The conclusion is to avoid using structure arrays in parallel programs.

Jun 23, 2019

I want to do something like this: create a shared array Wand fmodify it in parallel in a function W. code show as below:

using Distributed
using SharedArrays

addprocs(4-nprocs())
println("Running ",nprocs()," processes")

t = 2. ; nx=1000; ny=1000
W = SharedArray{Float64}((nx,ny),init=A->(A=zeros(nx,ny)))


function f()
    @time @sync @distributed  for i=1:nx
        for j=1:ny
            W[i,j] = W[i,j] + t
        end
    end
end

f()

rmprocs(2,3,4)

Explain this code line by line-

addprocs(4-nprocs())

By default nprocs()=1, 3 remote workers are enabled. Open the task manager at this time, you will see: The
Insert picture description here
first one is a resident terminal and does not participate in calculations. The second is the main process, whether it is parallel or not, it will start when there is a computing task. The remaining three are remote workers.

t = 2. ; nx=1000; ny=1000

Declare three variables, the first one adds a decimal point, so that Julia is automatically recognized as a floating point number.

W = SharedArray{Float64}((nx,ny),init=A->(A=zeros(nx,ny)))

Declare a shared array and initialize it to zero, and store it on the main process by default, so in subsequent parallel calculations, staring at the task manager will see that the main process completes faster than the three remote workers. As shown:
Synchronize

function f()
    @time @sync @distributed  for i=1:nx
        for j=1:ny
            W[i,j] = W[i,j] + t
        end
    end
end

There are several points to explain about this function:

  • fYou can take no parameters, and all variables and arrays are automatically inherited into the local domain of the function. If there is a parameter, the parameter will follow the normal parameter transfer method, and the rest will still be inherited automatically. For example, amended to:
function f(nx)
   for i=1:nx
       for j=1:ny
           W[i,j] = W[i,j] + t
       end
   end
end

W1  =  f(nx)
W2  =  f(nx+1)

Will see W1normal operation, and W2report an error.

  • Go back to the original code.
    @distributedAlready introduced. Julia's multi-layer for loop can be abbreviated as a for i=1:m, j=1:n, k=1:pform, but it @distributedcan only identify the outermost layer, so the outermost and inner layers must be separated and written as:
@distributed for i=1:m
   for j=1:n, k=1:p
   	   <Expr>
   end
end

Since Julia reads the array column by column, that is, ithe speed of traversing the first indicator is significantly faster than other indicators, so it is best to put the indicator ion the innermost side, and divide other indicators first to maintain ithe integrity of the indicator . change into:

function f()
    @time @sync @distributed  for j=1:ny
        for i=1:nx
            W[i,j] = W[i,j] + t
        end
    end
end

f()

The original code consumption 0.316919 seconds (185.12 k allocations: 9.060 MiB), after modification 0.262387 seconds (185.10 k allocations: 9.058 MiB), the memory consumption is almost the same, but the speed is faster.

  • When designing a parallel program, it is natural to think about whether to use broadcast variables. The above example shows that @distributedthe variables in the three different positions of the structure nx, ny, tdo not need to be broadcast, so the variables in the structure that are not @distributedmodified by multiple processes need not be broadcast. So if you want to modify a variable, should it be broadcast? Let's look at the following example:
function f()
    @time @sync @distributed  for j=1:ny
        for i=1:nx
            @everywhere t += 1
            W[i,j] = W[i,j] + t
        end
    end
end

f()

If you remove one of them, an @everywhereerror will be reported, proving that broadcasting is feasible. However, you will find that the calculation time is greatly extended because it @everywhereis a remote call command and it is time-consuming to execute it repeatedly. If you move it to the front, like this:

@everywhere t = 2
function f()
    @time @sync @distributed  for j=1:ny
        for i=1:nx
            t += 1
            W[i,j] = W[i,j] + t
        end
    end
end

f()

The system will report an error. So what is a reasonable method? The answer is to tchange to a parameter, like this:

function f(t)
    @time @sync @distributed  for j=1:ny
        for i=1:nx
            t += 1
            W[i,j] = W[i,j] + t
        end
    end
end

f(t)

You will see that the calculation time increases very little. This approach does not require broadcasting. but! ! ! When you print Wit, you will find that the calculation result has changed. For example, if you want to consistently modify tand add to Wit, take nx=3; ny=2the situation as an example, and get W:

julia> W
3×2 SharedArray{Float64,2}:
 3.0  3.0
 3.0  3.0
 3.0  3.0

If you use the above method of passing parameters, you will get:

julia> W
3×2 SharedArray{Float64,2}:
 3.0  3.0
 4.0  4.0
 5.0  5.0

The indicated tmodification iis superimposed along the dimension of the indicator . And the @everywhere t+=1method will get:

julia> W
3×2 SharedArray{Float64,2}:
 4.0  4.0
 6.0  6.0
 8.0  8.0

This is even more exaggerated. tThe modification is also superimposed iin the jdirection every time the direction is superimposed. Where is the problem? Obviously, there is a problem with the position of the expression. Change it to the following:

function f2(t)
    @time @sync @distributed  for j=1:ny
        t+=1
        for i=1:nx
            W[i,j] = W[i,j] + t
        end
    end
end

The correct result was obtained. The conclusion is: @distributedthe outermost loop of the structure is separated, independent of each other, and variable modifications will not overlap. However, the inner loop is still in accordance with the general program loop rules, and will superimpose the modification. @distributedKeep this in mind when using it.

  • @syncIt is to ensure that all processes have completed their tasks before continuing to go down (the next statement is @time). If you don't add it @sync, you will see @timethat a result is quickly returned 0.012252 seconds (8.00 k allocations: 421.867 KiB). At this time, each process in the task manager is still calculating. It can be seen that the @timereturn is only the cost of initiating a remote call. Setting the nxsum nylarger will make it more obvious. If you print when part of the process has not completed the calculation W, you will see that some of its elements have not changed. Since the shared array is stored in the main process by default, the main process will generally be completed first, and then the rest of the processes will be completed almost simultaneously, as shown in the figure:
    asynchronous
rmprocs(2,3,4)

Finally, the redundant processes must be closed like this, otherwise the system will add new memory to the existing processes when the code is executed again, resulting in a doubling of the memory occupied, and running it a few times will be overwhelming. After closing the remote process as shown below:
shut down
but remember not to try to close the main process, otherwise the system will refuse to execute the rmprocs()command.

Guess you like

Origin blog.csdn.net/iamzhtr/article/details/93380146