Import a "Tai Chi" library to speed up Python code by 100 times!

We all know that Python's simplicity and readability come at the expense of performance——

Especially in computationally intensive cases such as multiple for loops.

But now, the boss Hu Yuanming said:

Just import a library called "Taichi" and you can speed up your code by 100 times!

Do not believe?

Let's look at three examples.

Calculate the number of prime numbers, speed x120

The first example is very, very simple, find all prime numbers less than a given positive integer N.

The standard answer is as follows:

picture

We save the above code and run it.

When N is 1 million, it takes 2.235s to get the result:

picture

Now, let's do the magic.

Without changing any function body, import "taichi" library, and then add two decorators:

picture

The same result only takes 0.363s, nearly 6 times faster.

picture

If N=10 million, it only takes 0.8s; you know, without it, it takes 55s, which is 70 times faster!

Not only that, we can also add a parameter to ti.init() to ti.init(arch=ti.gpu) to let taich perform calculations on the GPU.

Then at this time, it only takes 0.45s to calculate all the prime numbers less than 10 million, which is 120 times faster than the original Python code!

Dynamic programming, speed x500

Needless to say, dynamic programming, as an optimization algorithm, reduces calculation time by dynamically storing intermediate calculation results.

Let's take the classic dynamic programming case "Longest Common Subsequence Problem (LCS)" in the classic textbook "Introduction to Algorithms" as an example.

For example, for sequence a = [0, 1, 0, 2, 4, 3, 1, 2, 1] and sequence b = [4, 0, 1, 4, 5, 3, 1, 2], their LCS is :

LCS(a, b) = [0, 1, 4, 3, 1, 2]。

Calculating LCS with the idea of ​​dynamic programming is to first solve the length of the longest common subsequence of the first i elements of sequence a and the first j elements of sequence b, and then gradually increase the value of i or j, repeat the process, and get the result.

We use f[i, j] to refer to the length of this subsequence, that is, LCS((prefix(a, i), prefix(b, j). Among them, prefix(a, i) represents the first i elements of sequence a , namely a[0], a[1], ..., a[i - 1], the following recurrence relation is obtained:

picture

Now, we use Taichi to speed up:

picture

The result is as follows:

picture

Reaction-Diffusion Equation

In nature, there are always some animals with patterns that seem to be disordered but are not completely random.

picture

Alan Turing, the inventor of the Turing machine, was the first to come up with a model to describe this phenomenon.

In this model, two chemicals (U and V) are used to simulate pattern generation. The relationship between these two is similar to prey and predator, they move and interact by themselves:

1. Initially, U and V are randomly distributed over a domain;

2. At each time step, they gradually spread to adjacent spaces;

3. When U and V meet, part of U is swallowed by V. Therefore, the concentration of V increases;

4. To avoid U being eradicated by V, we add a certain percentage (f) of U and remove a certain percentage (k) of V at each time step.

The above process is outlined as the "reaction-diffusion equation":

picture

There are four key parameters: Du (the diffusion speed of U), Dv (the diffusion speed of V), f (the abbreviation of feed, which controls the addition of U) and k (the abbreviation of kill, which controls the removal of V).

If this equation is implemented in Taichi, first create grids to represent domains, and use vec2 to represent U, V concentration values ​​in each grid.

Computation of Laplacian values ​​requires access to adjacent grids. To avoid updating and reading data in the same loop, we should create two grids of the same shape W×H×2.

Every time data is accessed from one grid, we write the updated data to another grid and then switch to the next grid. Then the data structure design is like this:

picture

Initially, we set the concentration of U in the grid to 1 and place V at 50 randomly chosen locations:

picture

Then the actual computation can be done in less than 10 lines of code:

@ti.kernel def compute(phase: int):
    for i, j in ti.ndrange(W, H):
        cen = uv[phase, i, j]
        lapl = uv[phase, i + 1, j] + uv[phase, i, j + 1] + uv[phase, i - 1, j] + uv[phase, i, j - 1] - 4.0 * cen
        du = Du * lapl[0] - cen[0] * cen[1] * cen[1] + feed * (1 - cen[0])
        dv = Dv * lapl[1] + cen[0] * cen[1] * cen[1] - (feed + kill) * cen[1]
        val = cen + 0.5 * tm.vec2(du, dv)
        uv[1 - phase, i, j] = val

Here we use an integer phase (0 or 1) to control which grid we read data from.

The last step is to dye the result according to the concentration of V, and you can get such an amazing pattern:
picture

Interestingly, Hu Yuanming introduced that even if the initial concentration of V is set randomly, similar results can be obtained every time.

And compared with the Numba implementation that can only reach about 30fps, the Taichi implementation can easily exceed 300fps because the GPU can be selected as the backend.

pip install to install

In fact, Taichi is a DSL (Dynamic Scripting Language) embedded in Python. It uses its own compiler to compile the functions decorated by @ti.kernel to various hardware, including CPU and GPU, and then perform high-performance computing.

With it, you no longer need to envy the performance of C++/CUDA.

As the name suggests, Taichi comes from Taichi Graphics Hu Yuanming's team. Now you only need to use pip install to install this library and interact with other Python libraries, including NumPy, Matplotlib, and PyTorch.

Of course, what is the difference between Taichi and these libraries and other acceleration methods? Hu Yuanming also gave a detailed comparison of advantages and disadvantages. Interested friends can click the link below to view in detail :

https://docs.taichi-lang.org/blog/accelerate-python-code-100x

Guess you like

Origin blog.csdn.net/weixin_52051554/article/details/130301339