2021ASC Supercomputing Competition QuEST configuration and actual combat (quantum computing)

 

This is the ASC supercomputing competition in 2021. Because the early stage is busy, so...

Submit on January 8th, start working on the morning of January 7th

(It can't be done!!!)

See if it can be completed in one wave in the past two days, and record the work experience and completion steps of the two days.

 

Swipe WeChat in the morning: I can have winter vacation, not bad

Swipe for a while Alipay: the fund fell yesterday, crying

Look in the mirror: I'm bald again, which is a good sign of becoming stronger

 

start working:

 

1. Tasks:

1) Download the quest source code

https://github.com/QuEST-Kit/QuEST/releases downloadsource code

The source code is this

2) Download two payloads

https://pan.baidu.com/s/1t7miv2h2MyEmY191y6lXOA Password: fhd4

Load is this

3) QuEST can refer to official documents ( https://quest.qtechtheory.org/docs/ )

Run and view load 1

1. Copy mytimer.hpp to /QuEST/include

2. Copy random.c to /examples and rename it to tutorial_example.c

3. Return to the original directory,

mkdir build

cd build

cmake ..

make -j4

./demo

Successfully reported an error

As you can see, the result to be submitted has been produced

But the error is reported, it may be that the memory of the virtual machine is too small

Finally, it ran through a classmate’s super CPU+large memory computer, ran through the sample, and output log files.

2. Need to submit:

1) Random circuit is compressed to PRESTO_Dedispersion.tar.gz

1.Probs.dat

2.stateVector.dat

3.Command line file(*.sh)

4.Screen output(*.log)

2) GHZ_QFT is compressed to GHZ_QFT.tar.gz

1.Probs.dat

2.stateVector.dat

3.Command line file(*.sh)

4.Screen output(*.log)

So we created two folders, each containing the above files ()

3. Completion status :

1、A brief introduction of QuEST

Before optimizing QuEST, we will introduce the functions and source code of the QuEST simulator:QuEST is primarily developed by Anna Brown of the e-Research Centre, and Tyson Jones of the QTechTheory group, at the University of Oxford, with help from Simon BenjaminMihai Duta and Ian Bush. QuEST is driven by the needs of Oxford’s QTechTheory group to precisely and efficiently simulate deep quantum circuits of many qubits. To date, QuEST has been used to simulate up to 38 qubits across 2048 computers, combining the processing power of ~49k CPUs.

A qubit is a two-dimensional Hilbert space (an abstract space describing the state vector. Since the polarization state of the photon and the spin state of the electron have only two orthogonal orientations, they are equivalent to two mutually perpendicular coordinate axes, so Two-dimensional) QuEST simulates quantum operations by simulating multiple qubits.

Figure: Qubit concrete description

QuEST can maintain both pure and mixed states, operated upon by single qubit, control and multi-control qubit gates, and non-unitary actions such as measurement and decoherence. Gates can be specified with their common names (e.g. hadamard), or as rotations on the Bloch-sphere around arbitrary axis, or as arbitrary complex matrices including a global phase factor. Since QuEST is low-level, there is no penalty for accessing the underlying wavefunction and performing otherwise expensive analysis.

QuEST allows you to simulate as many qubits as you can fit in memory. The state vector for n qubits, when using a decimal precision with b bytes per number (e.g. 8 for a double), requires b 2^(qb – 29) GB. Using double precision, and considering a generous memory overhead, QuEST can simulate 26 qubits on a 2GB CPU, or 29qubits on a 16GB laptop , or 45qubits between 4096 256GB supercomputers.

 

 

2、Hardware environment: 

CPU:Intel Xeon Phi 7210 1.3Ghz 64 cores

Memory : 16G×4 DDR4 2133Mhz

Hard Disk : 120G SSD×1

With 16GB HBM on CPU,in flat mode

Software environment:

OS: GUN / Linux CentOs 7.2

Compiler : Intel Composer XE Suites 2017.1.043

Path: /opt/intel/parallel_studio_xe_2017.1.043

MKL: Intel MKL 2017.1.043

Path:/opt/intel/compilers_and_libraries_2017/linux/mkl/

MPI : Intel MPI 2017 Update 1 Build 20161016

Path: /opt/intel/impi/2017.1.132/

PBS : Torque

QuESt : 3.2.0

Because in the process of configuration environment we made many attempts and changes, the process is very complex and redundant.Therefore, in our next article, the pictures and other information displayed are the results of the last calculation.

3、Configure the basic environment and compile

Use Putty to link node mu01 as the subsequent configuration environment and compilation basis

图:Initial environment variable configuration of mu01 node

1.Transfer the downloaded QuEST file, random payload, and GHZ_QFT payload to node mu01.

2.Copy mytimer.hpp to /QuEST/include

3.Copy random.c to /examples and rename it to tutorial_example.c

4.Return to the original directory,

mkdir build

cd build

cmake ..

图:Camke output log results

 

make -j4

图:make compile and output log results

 

After the compilation is completed, an executable elf program demo is generated. We run the demo to observe the running time of the program and other log output.

Taking random as an example, the initial sample log output of the program is as follows:

 

./demo

图:Random non-optimized running instance

 

Take GHZ_QFT as an example, the initial sample log output of the program is as follows:

./demo

图:GHZ_QFT non-optimized running example

 

4、Multithreading optimization technology based on OpenMP

Based on OpenMP multi-threaded optimization technology, the following conditions must be met:

1. The induction variable of the loop (i.e. i) must be a signed integer, and other induction variables cannot be optimized.

2. The comparison condition of the loop must be one of <<=> >=

3. The incremental part of the loop must increase or decrease a constant value (that is, it is constant each time the loop).

4. If the comparison symbol is <<=, then i should increase each time through the loop, otherwise it should decrease

5. The inside of the loop cannot contain too many variables related to the outside, that is, variables cannot jump from the inner loop to the outer loop. Goto and break can only jump inside the loop, and exceptions must be caught inside the loop.

Using the OpenMP scanner to scan the QuEST framework, we did not find any conflicting functions with OpenMP, so OpenMP can be used for optimization.

图:OpenMP configuration example

Load balancing is the factor that has the greatest impact on performance in multi-threaded programs. Only when load balancing is implemented can it be ensured that all cores are busy without idle time. If there is no load balancing, some threads will end much earlier than others, leading to the possibility of processor idle wasted optimization. We try to optimize each computing space using multiple threads.

图:OpenMP multi-threaded optimization

In the loop, the time difference between each iteration is often large and the load balance is destroyed. You can usually check the source code to find the possibility of loop changes. In most cases, each iteration may find approximately the same time. When this condition is not met, you may be able to find a subset that takes approximately the same time. For example, sometimes all even-numbered cycles take the same time as all odd-numbered cycles, and sometimes the first half of the cycle and the second half of the cycle take similar time. On the other hand, sometimes a set of cycles that takes the same time may not be found We provide this information to OpenMP, so that OpenMP has a better chance to optimize the loop, or optimize through multiple threads in parallel.

图:Multi-thread optimization of OpenMP concatenation

By default, OpenMP considers that all loop iterations run at the same time. This causes OpenMP to divide different iterations into different cores and distribute them to minimize memory access conflicts. This is because loops generally access memory linearly, so the allocation of loops according to the first half and the second half can minimize conflicts. However, this may be the best method for memory access, but it may not be for load balancing The best method, and in turn, the best load balancing may also destroy memory access. You can use a multi-part qubit simulation to show that it is optimized with OpenMP and communicate to achieve time optimization.

 

 

5、Code layer optimization technology based on deleting induction variables and copying propagation.

Delete the inductive variable technique:

Input: Loop L with information about reaching the fixed value and the calculated family of induction variables

Output: modified loop

Method: For each basic induction variable i, each induction variable j in the family: (i, c, d)

Perform the following steps

1. Create a new temporary variable t. If the variables j1 and j2 have the same triplet, only a new variable is created for them.

2. Replace the assignment of j with j=t.

3. Immediately after the fixed value i=i+n in L, add t=t+c*n. Put t into the i family, and its triplet is (i, c, d).

4. At the end of the pre-node, add the statements t=ci and t=t+d, so that t=ci+d=j at the beginning of the loop。

Figure: Example diagram of deleting induction variables

Copy propagation technology:

The purpose of replication (copy, assignment) propagation is to use the replication statement to replace the reference to the left end of the copy with the reference to the right end as much as possible. Although there is no reduction in the number of instructions on the surface, the advantage of doing this is that the left-end item may no longer be active after the optimization, so that the dead code can be eliminated.

The implementation method of replication propagation is similar to the extraction of public expressions, and corresponds to the available expressions, which considers the available replication statements. A duplicate statement generates a usable duplicate statement, and a statement with a fixed value can kill all duplicate statements related to the virtual register of the fixed value (including the ones that appear in the left end item or the right end item). In addition, the data flow direction, intersection operation and initial value of the replication propagation are the same as the data flow of the available expressions.

After calculating all the available copy statements at each statement, examine the virtual register a used by a statement. If there is an available copy statement and its left end item is a, then a can be replaced with its right end item. This process can also go on recursively: Let this right end item be b, if there is an available copy statement, and its left end item is b, then a can be further replaced with its right end item c. In this way, multiple layers can be spread in one step.

图:Copy propagation example diagram

6、Optimize running results

The QuEST framework is optimized through OpenMP optimization technology, etc., and the calculation time of the theoretical results is reduced. Putting the theory into practice, you can find that the random result is optimized to 319.712564 seconds, and the GHZ_QFT result is optimized to 232.605283 seconds.

图:random optimization run results

图:GHZ_QFT optimized running results

to sum up:

1) Search on salted fish, there is no relevant information, maybe my google hacking skills are not enough

2) Finally, look for the supercomputing competition papers of the previous seniors, and complete the essay according to their ideas.

3) After writing and testing in two days, it's okay

 

Guess you like

Origin blog.csdn.net/qq_42882717/article/details/112304216