In-depth analysis of GC optimization for each value type of XLua scheme under Unity

Reprinted from: http://gad.qq.com/article/detail/25645

 

foreword

The C# GC Alloc (hereinafter referred to as gc) under Unity is a big problem, and after embedding a dynamically typed Lua, the interaction between them can easily generate gc, and various Lua schemes also regard this as the focus of performance optimization. These optimizations are, to put it bluntly, not complicated.

The culprit is here

Let's take a look at these two functions

1
2
3
4
5
6
7
8
9
int inc1( int i)
{
     return i + 1;
}
 
object inc2( object o)
{
     return ( int )o + 1;
}

 

The measured performance of inc1 under the window is 20 times that of inc2!

Why is the gap so big? The main reason is the type of its parameters and return. The inc2 parameter is of object type, which means that a value type (such as an integer) needs to be boxed. The specific point is to apply for a piece of memory on the heap, copy the type information and value into it, and use it. When you need to unboxing, that is, copy the memory from the heap just now to the stack. After the function is executed, the heap memory is detected by gc and has no reference, and the heap memory is released.

The 20-fold gap is a case of one parameter and one return, and as more such parameters are added, the gap becomes larger. And what's even worse: GC is difficult to control. In Unity's mobile game projects, GC is often the culprit of stuttering.

At present, all lua solutions are aimed at the gc optimization of the interaction between lua and c#, or the optimization of value types, in fact, they are doing one thing: avoiding the situation of inc2 .

C# calls Lua to avoid inc2

Lua is a dynamically typed language. Its functions can accept any type and any number of parameters, and the return value is also any type and any number. If you want to use a generic interface to access lua functions, the situation will be worse than inc2: in order to support any number of parameters of any type, we may have to use variadic parameters; in order to support multiple return values ​​of any type, the interface may need to return an object array, not an object. So we have two more arrays to allocate and free. The function prototype is roughly as follows:

object[]Call(params object[] args)

For the above reasons, although most programs provide this method (because of convenience), it is not recommended. Some solutions will provide GC-free usage. For example, if ulua wants to avoid GC, you have to do this:

1
2
3
4
5
6
var func = lua.GetFunction( "inc" );  
func.BeginPCall();
func.Push(123456);
func.PCall();
int num = ( int )func.CheckNumber();
func.EndPCall();

The idea is to expose the stack operation api of lua, push the parameters one by one, and call the return values ​​one by one. The interfaces for pushing the stack and getting the return value are all of a certain type, in other words, the interface of inc1.

The above is only the case of a single parameter and a single return value. In most cases, the code will be more complicated.

And slua did not find a relevant solution.

 

The core idea of ​​xLua's solution is: as long as you tell me what parameters to call, I will help you optimize.

1
2
3
4
[CSharpCallLua]
public delegate int Inc( int i);
Inc func= luaenv.Global.Get( "inc" );
int num =  func(123456);

1. Declare a delegate as you need and label it with CSharpCallLua;

2. Execute the generated code;

3. Use the Get interface of Table to map the inc function to the func delegate;

4. Next, you can use this delegate happily.

The complex parameters are the same as above: declare, get, use. Only one more declaration than the Call interface with gc, it is as simple to use as the Call interface, and even simpler to deal with the return value, and it also brings the benefits of strong type checking.

What if a lua function has multiple return values?

Multiple return values ​​will be mapped to C# return values ​​and output parameters, one by one from left to right.

In addition, xLua also supports the mapping of a lua table to a C# interface. Property access to this interface will access the corresponding fields of the lua table, and member method calls will call the corresponding functions in the lua table. Again, no gc.

How is this done? It's not complicated to say. Taking the mapping of lua functions to c# delegates as an example, xLua will generate a piece of code for the delegate that declares CSharpCallLua. For example, the generated code of Inc will look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
public int SystemInt32( int x)
{
     //...init
     LuaAPI.lua_getref(L, _Reference);
              
     LuaAPI.xlua_pushinteger(L, x);
     int __gen_error = LuaAPI.lua_pcall(L, 1, 1, err_func);
 
     //...error handle
     int __gen_ret = LuaAPI.xlua_tointeger(L, err_func + 1);
     LuaAPI.lua_settop(L, err_func - 1);
     return  __gen_ret;
}

 

The delegate returned by the Get method will point to this method. Judging from this code, it is similar to ulua's non-gc code. The difference is that others have to write it by hand, and because xLua lacks a layer of encapsulation, it should be more efficient to directly call Lua's api.

 

Complex value type optimization

From C# to lua complex object transfer

The lua virtual machine is unmanaged code for .net. To pass objects in the past, several problems must be solved:

1. During lua using the object, the object cannot be gc;

2. If the unmanaged code (lua) calls back the managed code (c#), when the reference to the object is returned, the corresponding object should be found correctly;

3. Repeatedly pass an object, the reference in the unmanaged code test is preferably consistent;

Question 1 and Question 2 The official solution is pined object. The measured performance of pining an object and its release is roughly equivalent to that of Dictionary's Set/Get, while questions 1 and 2 can be optimized for array operations, and the performance can be 4~ higher than the Pined solution. 5 times: accepts an object, finds an empty position in an array and puts it in, and returns the subscript of the array as an object reference. By organizing empty locations through a linked list, empty location search can be optimized to O(1) operations, and finding objects by reference is of course O(1).

There is no good solution to problem 3. Use Dictionary to create an index of objects to references.

 

The Dilemma of Complex Value Types

Everything in C# is an object, and naturally it also includes value types. The above scheme can also be used. This function is no problem, but the performance has encountered Waterloo:

Every time a value type is put into the object pool (referring to the set of mechanisms mentioned in the previous section to solve the three problems), it will encounter the case of inc2, which will be boxed into a new object, and the input A sequence of operations for the pool. Some people will ask whether the pined scheme will not have this problem. In fact, it is the same. The value type is on the stack, and after pined, it needs to be transferred from the stack to the heap. The stack transfer will still have a similar process: allocating heap memory. , copy, release when used up.

This problem has a wider impact than the previous one. As long as C# passes a complex value type to lua, it will appear. For example, the ordinary Vector3 four arithmetic operations will generate a large amount of gc.

The idea of ​​ulua and slua is the same. Hard-coded optimization is performed on specific U3D value types (Vector2, Vector3, Vector4, Quaternion). Take Vector3 as an example:

1. Reimplemented all the methods of Vector3 with lua;

2. The Vector3 of C# is passed into lua: first, build a luatable on the lua side, and set the x, y, and z of the Vector3 to be passed into the corresponding fields; the method of setting the metatable of the table to 1 is implemented;

3. Lua returns Vector3 to C#: After C# builds a Vector3, it takes out the x, y, and z fields of the corresponding table and assigns them to Vector3;

 

Complex value type optimization for xLua

There are some problems with the above optimization: it is very difficult to add a new value type, so the value types that can be supported by this solution can be counted on one's fingertips, and user-defined structs are even more impossible to support. It is also unreasonable to deeply couple these types of core code. There is also a more serious problem: xLua authors are more resistant to hardcoding this behavior.

Let's think about it, what is the essence of ulua and slua's optimization to avoid gc? There is also a simple value type passed from C# to lua does not generate gc, what is the reason?

The answer is: value copy !

The complex value type optimization of ulua and slua, the transfer from C# to lua is essentially to copy the Vector3 value to the lua table, avoiding entering the pool and thus avoiding inc2; the same is true for simple value types, a c# int is passed into lua, and it is also directly The int value is copied to the lua stack.

It is a lot more open to understand this idea. xLua has designed a new value type scheme. As long as a struct contains only value types, structs can be nested. Of course, the nested structs are required to contain only value types. Be applicable.

The principle is not complicated:

1. Generate the value copy code of the struct, which is used to copy the fields in the struct to a piece of unmanaged memory (Pack function), and to copy the value of each field from the unmanaged memory (UnPack function);

2. C# transfers struct to lua: call lua's api, apply for a piece of userdata (unmanaged code for c#), and call Pack to package the struct;

3. Lua returns to c#: call UnPack to solve the struct;

4. The method of struct still follows the original implementation of c#;

To put it bluntly, it is similar to pb, serializing the data structure of c# to a piece of memory and deserializing it back from memory.

Let's talk about the disadvantages of this scheme first:

The disadvantage stems from the fact that this scheme calls the struct method or calls the original C# implementation. From lua via C language and then via pinvoke to C#, the cost of this adaptation is far greater than the overhead of executing some simple methods. Of course, xLua only calls the implementation of C# by default, and it is not necessary. xLua provides an API to directly read and change the struct field in C without going through C#. The more diligent children's shoes can use this API to try to place high performance in places Implemented in Lua, which avoids the cost of adaptation between Lua and C#.

PS: a very popular lua solution performance use case on the Internet, use Vector3.Normalize to test the performance of lua calling c# static functions, and even the official evaluation issued by Unity uses this use case. From the previous analysis, we can know that this is wrong. The Vector3.Normalize of these tested solutions only run in lua, and "lua calls c# static function" is not tested at all.

Advantages of this program:

1. The supported struct types are much wider, and what the user has to do is very simple. It is enough to declare the code to be generated (GCOptimize). The reason for the declaration is to avoid generating too much code;

2. Compared with the table scheme, it saves more memory, only the size of the struct plus a header, and the empty table under 64-bit is 80 bytes+. The actual memory usage of the userdata scheme of Vector3 is one-third of the table scheme;

 

Other value type GC optimizations

Most of the following optimizations are only valid in xLua. You can see the usage in its 05_NoGc example. After generating the code, run it in the profiler to see your effect.

1. The enumeration type is passed without GC;

2. decimal does not lose precision and has no GC;

3. For all types without GC, its array access has no GC, which seems to be done by most solutions;

4. The struct that can be optimized by GCOptimize can directly pass a table corresponding to the structure in Lua, without GC;

5. LuaTable provides a series of generalized Get/Set interfaces, which can pass value types without GC;

6. After an interface is added to CSharpCallLua, a table can be used to implement this interface, and there is no GC to access the table through this interface;

These optimizations are in the same line as the two major ideas introduced earlier. You can see their implementation through the source code, so we will not analyze them.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325262600&siteId=291194637