Luajit official performance optimization guide and notes

Luajit is one of the fastest scripting languages at present, but when you use it in depth, you will soon find that it is not so easy to use this language as high-performance as claimed. When it is actually used, it is often found that the performance of some small test cases that have just been written is very good, often even in the millisecond level, but once the code complexity goes up, the situation of tens of hundreds of milliseconds will occur, and the performance is very erratic .

For this reason, Luajit's mailing list is also consulted by many people. A more complete answer by the author mike pall was placed on the official wiki:

http://wiki.luajit.org/Numerical-Computing-Performance-Guide

But the original text said a lot about how to do it, but basically did not explain why.

So this article is not a simple translation of the official optimization guide. The most important thing is to let everyone understand some of the principles behind luajit, because the original text only tells you how to do it, but it does not explain why, resulting in these optimizations, how much impact, The reason is very vague. Understanding the reasons behind it often helps us a lot.

In addition, the native lua, luajit's jit mode (available for PC and Android), luajit's interpreter mode (only this can be run under ios), the principles of their implementation of lua are very different, but also lead to some lua optimization skills are not seen It must be universal. This article focuses on luajit's jit model.

1.Reduce number of unbiased/unpredictable branches.

Reduce unpredictable branch code

The branch code is the code that will jump according to the condition (the most typical is if..else), then what is the unpredictable branch code? simply say:

if condition 1 then

elseif 条件2 then

If condition 1 or condition 2 has a very high probability (> 95%), then we think this is a predictable branch code.

This is the performance optimization point that Mike Pall puts in the first place (in fact, it should be the case). The reason is that luajit uses the characteristics of the trace compiler. In order to generate the machine code as efficiently as possible, it will be based on the operation of the code. Make some assumptions, such as the above example. If Luajit finds that the probability of achieving Condition 2 is very high, then Luajit will generate the fastest execution code according to Condition 2.

One thing you may ask is that Luajit really knows something about the process?

Yes

This is also a feature of the trace compiler: run bytecode first, profile the hotspot code, and understand the points that can be optimized before optimizing the most efficient machine code. This is Luajit's current approach.

Why has to be this way? Give a more understandable example: luajit is a dynamically typed language. In the face of a + b, you do n’t know what type a and b are. If a + b is just the addition of two integers, then compile the machine code to sum The speed is naturally fast. But if you can't confirm this, you can only assume that it is any type, first go to dynamically check the type (to see whether it is two tables, two values, or even other cases), and then skip to do the corresponding according to the type Processing, think about it and know that it is tens of times slower than adding two integers.

Therefore, for the ultimate performance, Luajit will make bold assumptions. If a + b is found to be the addition of two values, it will compile the machine code of the numerical sum.

But what if a + b is not a numerical addition at a certain moment, but it becomes the addition of two tables? Doesn't this machine code cause an error? Therefore, every time luajit makes a hypothesis, it will add a piece of guard code (guard), check if the hypothesis is correct, if it is not correct, it will jump out, and then according to the situation, decide whether to compile a new piece of machine code To adapt to the new situation.

This is why your branch code must be predictable, because if things that often do not meet Luajit's assumptions, they will often jump out of the compiled machine code, or even jump because of several failed assumptions. Therefore, luajit is a language that is extremely sensitive to branching.

This is Luajit's first performance pit. The author suggests that you can use math.min / max or bitop to bypass branch codes like if else. However, the actual situation is often more complicated, and all places involving jump codes are potential performance pits.

In addition, in interpreter mode (in the case of ios), luajit becomes the execution mode of honest and dynamic dynamic check, which is not sensitive to branch prediction, and does not need to pay too much attention to this aspect of optimization.

2.Use FFI data structures.

If you can, implement your data structure with ffi instead of lua table

Lujit's ffi is a function that is often overlooked by everyone, or just used as a better C export library, but in fact this is a super performance tool.

For example, to implement Vector3 in unity, use lua table and ffi respectively, we tested it, the memory occupation is 10: 1, the operation time of x + y + z is also about 8: 1, and the optimization efficiency is amazing.

code show as below:

local ffi = require("ffi")

ffi.cdef[[

typedef struct { float x, y, z; } vector3c;

]]

local count = 100000

local function test1 ()-code of lua table

local vecs = {}

for i = 1, count do

vecs[i] = {x=1, y = 2, z = 3}

end

local total = 0

-Record the time and memory usage of the following for loop after gc, omitted here

for i = 1, count do

total = total + old [i] .x + old [i] .y + old [i] .z

end

local function test2 ()-code for ffi

local vecs = ffi.new("vector3c[?]", count)

for i = 1, count do

vecs[i] = {x=1, y = 2, z = 3}

end

local total = 0

-Record the time and memory usage of the following for loop after gc, omitted here

for i = 1, count do

total = total + old [i] .x + old [i] .y + old [i] .z

end

Why is there such a big gap? Because lua table is essentially a hash table, accessing fields in the hash table is slow and a lot of extra things have to be stored. And ffi can only allocate three xyz float spaces to represent a Vector3, the natural memory footprint is much lower, and jit will use ffi information to achieve direct access to memory when accessing xyz, rather than like a hash table Walking the key hash once, the performance is much higher.

Unfortunately, ffi only has a good running speed when it has jit mode. Nowadays, mobile games basically need to do ios, and because ios can only run interpretation mode under ios, the performance of ffi is very poor (more than pure table) (Slow), only the memory advantage is retained, so if you want to consider a platform like ios, this optimization point can basically be ignored, or only optimized for a few core codes under Android.

3.Call C functions only via the FFI.

Use ffi to call the c function whenever possible.

Similarly, ffi can also be used to call c functions that have extern c. Everyone seems to think that this just saves the trouble of exporting with tools such as tolua, but the greater benefit of ffi is the improvement in quality.

This is because, using ffi to export the c function, you need to provide the prototype of the c function. With the prototype information of the c function, luajit can know the exact type of each parameter and the exact type of the return value. Students who understand compiler knowledge know that function calls and returns are generally implemented using stacks. To do this, you must know the entire parameter list and return value types to generate code that is pushed out of the stack. Therefore, after having this information, Luajit can generate machine code and make seamless calls like the C compiler, without the need to call pushint and other functions to pass parameters like the standard lua and c interaction.

If you do not call the c export function through ffi, then because luajit lacks information about this function, it cannot generate jit code for calling the c function, which naturally reduces performance. And before version 2.1.0, this will directly cause jit to fail, the entire relevant code cannot be jitized, and performance will be greatly affected.

4.Use plain 'for i=start,stop,step do ... end' loops.

When implementing loops, it is best to use simple for i = start, stop, step do or ipairs, and try to avoid for k, v in pairs (x) do

First of all, until the latest luajit2.1.0beta2, for k, v in pairs (t) do end does not support jit (that is, it cannot generate machine code to run). As for the existence of this pit, it is mainly because the assembly of traversing the table by kv is relatively difficult to write, but at least you can know that if you want to traverse the array efficiently or do a for loop, directly using the numerical value as the index is the best method.

Secondly, this way of writing is more conducive to circular development.

5.Find the right balance for unrolling.

The cycle unfolds, there are pros and cons, you need to balance yourself

In the early C ++ era, manually expanding loop code into sequential code was a common optimization method, but later compilers all integrated certain loop expansion optimization capabilities instead of doing this kind of thing manually. The luajit itself also comes with this optimization (refer to its implementation function lj_opt_loop), you can expand the loop.

However, this deployment is done at runtime, so there are pros and cons. The author gives an example. If in a two-layer loop, the number of inner loops is less than 10, this part will be tried to expand, but due to the large loop nested in the outer, the outer big loop may cause the inner loop to enter multiple times. Expansion, resulting in too many expansions, eventually jit will cancel the expansion.

As for the performance in this area, the author has not given any in-depth tests, and the author has only given some more perceptual optimization suggestions (the last sentence, You may have to experiment a bit). Some students who understand know welcome to communicate.

6.Define and call only 'local' (!) functions within a module.

7.Cache often-used functions from other modules in upvalues.

Both of these points can be taken together, that is, when calling any function, ensure that this function is a local function, and the performance will be better, such as:

local ms = math.sin

function test()

math.sin (1)

ms(1)

end

What is the difference between these two lines calling math.sin?

In fact, math is a table. Math.sin itself does a table lookup and the key is sin, which is consumed once. And math is a global variable, then you have to do a lookup in the global table (_G [math])

After the local ms is cached, the math.sin search can be omitted. In addition, for variables on the upper layer of the function, lua will have an upvalue object to store. When looking for the variable of ms, it only needs to be in the upvalue object Find, the search range is smaller and faster

Of course, the jitized code may further optimize this process, but the better way is still to self-local cache

In short, if a function is only used in this file, it will be local, if it is a global function, use local cache before using it.

8.Avoid inventing your own dispatch mechanisms.

Avoid using your own implementation of the distribution call mechanism, and try to use built-in mechanisms such as metatable

In order to be elegant in programming, mechanisms such as message distribution are often introduced. Then, when the message comes, the corresponding implementation is called according to the enumeration we defined for the message. In the past, we used to write:

if opcode == OP_1 then

elesif opcode == OP_2 then

...

But under luajit, it is more recommended to implement the above as a table or metatable

local callbacks = {}

callbacks[OP_1] = function() ... end

callbacks[OP_2] = function() ... end

This is because both table search and metatable search can participate in jit optimization, and the self-implemented message distribution mechanism often uses branch code or other more complex code structures, and the performance is not as good as pure table search + jit optimization. fast

9.Do not try to second-guess the JIT compiler.

There is no need to help jit compiler to do manual optimization.

The author cites an example

z = x [a + b] + y [a + b], this is the way to write performance ok in luajit, there is no need to first local c = a + b and then z = x [c] + y [c]

The latter writing method is actually no problem in itself, but another pit of luajit is that in order to improve the operation efficiency, local variables will be stored in cpu registers as much as possible, which is much faster than frequently reading memory (modern cpu can reach hundreds of times Gap), but luajit is not perfect in this respect, once there are too many local variables, you may not find enough register allocation (this problem is very obvious on armv7, when the call level is deep, several variables will blow up) , And then jit will give up compiling directly. One thing to note here is that many local variables may just be declared useless there, but Luajit's compiler may not be able to accurately determine whether this variable can no longer be stored, so the number of local variables in a function scope is appropriately controlled is necessary.

Of course, I have to say that writing this code and guessing Luajit's behavior is really painful. Generally speaking, it is basically enough to profile and then test and optimize the performance hotspot code.

10.Be careful with aliasing, esp. when using multiple arrays.

Variable aliases may prevent jit from optimizing subexpressions, especially when using multiple arrays.

The author cites an example

x[i] = a[i] + c[i]; y[i] = a[i] + d[i]

We may think that two a [i] are the same thing, the compiler can be optimized into

local t = a[i]; x[i] = t + c[i]; y[i] = t + d[i]

This is not the case, because it may appear that x and a are the same table, so that x [i] = a [i] + c [i] changes the value of a [i], then y [i] = a [ i] + d [i] can no longer use the previous value of a [i]

The essential difference between this and the situation described in optimization point 9 is that in optimization point 9, z / a / b are all value types, and here x / a are all reference types, and reference types may refer to the same thing (variable aliases) ), So the compiler will give up this optimization.

11.Reduce the number of live temporary variables.

Reduce the number of live temporary variables

The reason has been stated in 9 that too many surviving temporary variables may exhaust the registers and cause the jit compiler to use the registers for optimization. Note here that live temporary variables refer to live temporary variables. If you end the life of temporary variables early, the compiler will still know this. such as:

function foo()

local a = "haha"

end

print(a)

end

Here print will print out nil, because a leaves do ... end and ends the life cycle. In this way, you can avoid too many temporary variables to survive at the same time.

In addition, there is a very common trap. For example, we have implemented a Vector3 type for expressing vectors in three-dimensional space, often overloading some of his metafunctions, such as __add

Vector3.__add = function(va, vb)

return Vector3.New (va.x + vb.x, va.y + vb.y, va.z + vb.z)

end

Then we will use a + b + c in the code to sum up a bunch of Vector3.

This is actually a big hidden danger for luajit, because each + generates a new Vector3, which will generate a large number of temporary variables, and regardless of the gc pressure here, just allocate registers for these variables. Very easy to go wrong.

So here is the best balance between performance and ease of use, each time if the sum is written to the original table, then the pressure will be much less, of course, the ease of use and readability of the code may need Sacrifice some.

12.Do not intersperse expensive or uncompiled operations.

Reduce the use of high consumption or operations that do not support jit

Here is a mention of a luajit document belonging to: NYI (not yet implement), which means that the author has not finished this function. .

Luajit is almost able to compile the code into machine code for execution, but not all code can be jitized. In addition to the for in pairs mentioned above, there are many such things, the most common ones are:

for k, v in pairs (x): The main reason is that pairs are implemented without jits. If possible, use ipairs instead.

print (): This is non-jitized. The author recommends io.write.

String connector: It is easy to write log ("haha" .. x) in the way of logging, and then avoid the consumption by shielding the implementation of log. Can it actually be blocked? Of course not eggs. Because the string link "haha" .. x will still be executed. In 2.0.x, this code does not support jit. Although 2.1.x finally supports it, redundant connection string operations and memory allocation still occur, so if you want to block, you can use log ("haha% s", x) This way of writing.

table.insert: At present, only jit is inserted from the tail. If it is inserted from other places, it will jump to c.

INGNIGHT

Published 524 original articles · praised 172 · 100,000+ views

His message board concerns

Luajit official performance optimization guide and notes

Guess you like