Optimization practice of Baidu APP iOS package size 50M (6) Useless method cleaning

I. Introduction

Baidu APP package size has been significantly reduced after a phase of optimization, such as cleaning up useless resources, taking useless classes offline, and Xcode compilation-related optimizations. However, after optimization, the APP package size still takes up 350M of space on iPhone11. At the same time, Baidu APP, as Baidu's flagship APP, has many and rapid business iterations. Volume optimization and deterioration prevention are still a core task at the current stage. Therefore, Baidu APP has started the work related to cleaning up useless methods with smaller granularity and higher repair risk. It is expected that through useless method cleaning, the package size of Baidu APP will be effectively reduced, while useless methods and redundant code in the project will be deleted to improve the cleanliness of the code.

Review of Baidu APP iOS package volume optimization practice series of articles:

2. Plan research

For cleaning up useless methods, we investigated the solutions currently announced by various manufacturers. The mainstream solution is based on the analysis of Mach-O + LinkMap files, but it mainly has the following problems:

1. Low accuracy

2. Manual filtering is required for system methods

3. Unrecognized calls related to load, initilize, and attribute

4. The string reflection call cannot be recognized, Target-Action registration, Observer registration method, etc. cannot be recognized.

5. Unable to recognize complex syntax scenarios, such as method calls in inheritance chains, subclasses implementing parent class methods, etc.

6. System notifications and other scenarios

Because the currently announced plans have the above shortcomings, and because the sensitivity of offline code is very high, related businesses are very cautious. Therefore, it is very important to promote the cleanup of relevant useless methods and the identification accuracy is directly related to the enthusiasm of relevant businesses to offline useless code, so the above solution is abandoned.

3. Plan selection

Analyzing the shortcomings of the second part of the plan, we can see that the core problem of its low accuracy is that when analyzing the product, we cannot get all the required information, or we have not found effective means to obtain what we expect. information. The best way to solve the problems mentioned above is to obtain as much code information as possible. Since we cannot trace back what we need from the product, we can consider finding the detailed information we need from the source, that is, at the source code level.

The source code certainly contains all the information, but how to analyze the source code? There are three main types:

  • Analyze source code directly through scripts

All grammatical rules of the source code need to be matched to be able to effectively analyze the source code, which is equivalent to writing a source code parser, so this plan was abandoned.

  • Analyze AST (Abstract Syntax Tree) directly through scripts

The abstract syntax tree (AST) generated during the compilation process contains all the required information, and clang also provides a command line, which can be used to directly obtain AST data. However, the clang command to obtain AST data takes a single class as the dimension, and the relationship between classes is difficult to obtain. For example, the inheritance relationship, the relationship between the classification and the main class cannot be obtained, so this solution was also abandoned.

  • Analyze AST through libtooling and Swift Compiler self-built compilation kit (Swift related will be introduced in the next article)

Since the AST product analysis generated by the clang command still cannot meet the needs, we directly intervene in the compilation process and obtain the required information from the internal AST generation process of the compilation. Finally, this solution was adopted. Use libtooling and Swift Compiler's self-built compilation suite to analyze the AST and obtain all the required information.

4. Scheme design

As mentioned above, Baidu APP finally adopted the libtooling and Swift Compiler static analysis solutions, so the following will elaborate on them from the principle and implementation levels.

4.1 Introduction to the compilation process

4.1.1 Overall structure of Xcode compilation

In this section, we first briefly talk about the structure of the compiler, the compilation process, and what static analysis is?

picture

△Figure 4-1

As shown in Figure 4-1, LLVM adopts the above three-phase structure (Three Phase Design), which are the compilation front-end (Frontend), the compilation optimization module, and the compiler back-end (Backend). So how do these three structures correspond to Xcode, as shown in Figure 4-2:

picture

△Figure 4-2

When compiling with Xcode on a daily basis, Xcode calls two compiler front-ends, namely Clang and Swift, to build a common compilation product through the two compiler front-ends, and then uniformly generate target files through the LLVM back-end compiler.

Through the Xcode compilation log, you can see that clang is used for compilation for Objective-C, C, and C++, and different compilation parameters are used to control the above three different languages:

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang

For swift files, the swift compiler is used for compilation:

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/swift-frontend

For these two executable files, you can unpack Xcode yourself and make command line calls, or you can use the --help command to check which compilation parameters or functions they support. Xcode's internal compiler is actually Apple's customized version of the open source versions of LLVM and Swift, which has certain differences from the open source versions.

4.1.2 Clang and Swift compilation process

As shown in the figure below, the Clang and Swift front-end compilation process, you can see that the Swift compilation process has an additional SIL part. In fact, there is also a SIL Guaranteed Transformations. Of course, the SIL part is not the focus. From Figure 4-3, you can see that both Clang and Swift compiler will generate AST and find that the AST contains most of the information we need, and Clang and Swift Compiler also expose relevant interfaces for obtaining AST information, so the remaining work is Only four points:

1. Build the compilation package project and ensure that it runs normally

2. Obtain the AST and obtain the required data based on the syntax features of Objective-C or C or C++

3. Conduct business analysis and processing on the acquired data

4. There are certain differences between the open source version LLVM and the actual version used by Xcode, so some compilation-related content needs to be adapted.

picture

△Figure 4-3

4.2 Overall scheme design

For the use of a programming language, as shown in Figure 4-4, it includes two levels, one level is declaration, and the other level is call. Declare classes, protocols, properties, methods, functions, etc. The content declared at the same time is for use, so the same declared content can be called. It is just a question of internal call or public call. From a technical point of view, all the contents of the declaration minus the declared contents that are called, what is left is the content that has not been called, which is the useless method we need. Of course, the technical judgment ultimately requires business judgment, because some are basic capabilities that are provided to the outside world. As for whether to delete them, further discussion is needed. This article mainly discusses technical issues.

picture

△Figure 4-4

From the clang source code, we can know that declaration and call respectively correspond to the base classes Decl and Expr in the LLVM source code. The overall technical solution is shown in Figure 4-5 below. The useless methods are divided into four layers:

1.Basic layer : assemble the compilation parameters required by the compilation tool + perform syntax rule matching

2.Transformer layer : Convert data matching syntax rules and convert general data formats

3. General data layer : The data generated through the Transformer layer is stored in categories. The stored data includes all the data of the code, such as attributes, methods, protocols and other data.

4. Business application layer : Business analysis can be performed on the stored data generated by the general data layer

picture

△Figure 4-5

4.3 Detailed solution implementation

4.3.1 Objective-C compilation tool construction

The compilation tool is presented in the form of an executable file similar to clang that comes with Xcode, as shown in the red box in Figure 4-6.

/Users/UserName/Documents/XcodeEdition/Xcode14.2/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang

picture

△Figure 4-6

Simply put, the compilation tool built through source code has some functions of Xcode clang. It uses the AST object generated during the compilation process to perform abstract syntax tree analysis and obtain all the syntax information of the required programming language.

4.3.1.1 LLVM source code construction

The construction of compilation tools requires the static libraries or dynamic libraries provided by LLVM. These libraries are obtained by building the LLVM source code yourself. You can get the LLVM source code path from github. After entering the LLVM github interface, you may be confused about which branch or tag code needs to be built. Which version corresponds to the clang used by Xcode? The current version of Xcode is 14.2 or 14.3. Use the command clang --version to see that Xcode uses clang 14, so release/14.x is built (no corresponding relationship is found, inferred), and the build is executed after the build is successful. clang --version will find that the open source version clang and the minor version number of Xcode are different. This is because the clang used by Xcode is customized by Apple based on the open source code, which depends on the number of dependent libraries or header files of clang in Xcode. In addition, you can also see from the compilation log that some parameters supported by Xcode clang are not supported by open source clang. Although Apple has some customizations, the overall impact is limited. Therefore, you don’t have to worry too much about whether the minor version numbers are consistent. (Preliminary verification is that it is also possible to build the latest release/16.x clang16).

picture

△Figure 4-7 _

There are two main types of specific build commands, one is the Ninja build method, and the other is the Xcode method. If you need Xcode to debug the source code, you can choose the Xcode mode. However, the static library that is eventually integrated into the compilation tool must be built in Release mode, so that the tool size will be smaller. To the minimum, some warning exceptions will also be blocked. You can refer to the start guide construction process in the LLVM open source library to build. The assembly commands involved can be spliced ​​by yourself or you can use the following commands:

构建过程
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
mkdir build (这个build文件夹可以自行命名,不固定。针对不同目标可以创建不同文件夹进行不同构建,如 mkdir ninjaBuild 或 mkdir xcodeBuild)
cd build (or cd xcodeBuild)
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release ../llvm
cmake --build .

Compile the Xcode version and replace Ninja with Xcode.

4.3.1.2 Project construction

LLVM provides two tools, libclang and libtooling. Baidu APP uses libtooling. The similarities and differences are as follows:

  • libclang: (online information, not tested)

    1. Provide a stable C interface with the ability to traverse the syntax tree, obtain Token, and code completion.

    2. The interface is stable and clang version updates have little impact on alignment.

    3.libclang cannot obtain all the information of AST

  • libtooling: (actual measurement)

    1. Provide a C++ interface, and the generated tools do not depend on the compiler and can be used as independent commands.

    2. The interface is unstable, and AST needs to be upgraded if relevant dependent libraries are updated.

    3.libtooling can obtain all information of AST

Finally, I chose the libtooling form. The core reason is that libtooling can obtain all the information of the AST and can run independently without relying on Xcode. The construction of the project itself is not complicated. It still belongs to the API usage level. You can directly refer to the official documentation of libtooling.

picture

△Figure 4-8

The overall code flow is shown in Figure 4-8. The main core points are five parts:

  • Parameter analysis

  • To create ClangTool, refer to LLVM source code ClangTooling -> Tooling.h Line309

  • Create ASTFrontendAction, used to obtain AST data, create ASTFonsumer and bind ASTMatcher

  • Match each syntax rule against ASTMatcher matching items

  • Perform data filtering and business processing based on matching data

4.3.1.3 Data storage structure design

The data storage structure adopts json format. The following is an example of the basic data format, which can be expanded according to actual needs:

"objc(协议or类)@类名(类方法or实例方法)@方法名称":{
"identifier":"objc(协议or类)@类名(类方法or实例方法)@方法名称",
"isInstance":true,
"kind":16,
"location":{
"col":36,
"filename":"文件名称",
"line":147
    },
"name":"方法名称",
"paramters":"参数",
"returnType":"返回值类型",
"sourceCode":"源码"
}
{"declaration":{"identifier":"objc(协议or类)@类名(类方法or实例方法)@方法名称","isInstance":true,"kind":16,"location":{            "col":列数,"filename":"声明所在类名",            "line":行数        },"name":"方法名称","paramters":"参数名称","returnType":"返回值类型","sourceCode":"源代码"    },"kind":1,"location":{"col":5,"filename":"当前所在文件名","line":15    }}

5. Problems encountered and solutions

1. Attribute calling identification problem

For Objective-C properties, there are two corresponding methods get and set after compilation. One is ivar. The caller may only call get or set or ivar, so when only one call occurs, even if this property is called, the current property Not a useless method. The other two methods need to be stripped out of the results.

2. When extracting method content, you also need to extract the header file.

The implementation of methods is not necessarily only in .m files. For example, C++ header files can implement methods, and Objective-C's .h files can implement some methods inline, which is also syntactically feasible. Therefore, when extracting methods, pay attention to the implementation files, and also pay attention to the header files.

3. Addressing inheritance issues

In scenarios such as subclasses implementing parent class methods, when identifying methods, all methods are traced back to their parent class, and the parent class name is used as the class name part of the identifier in the above data structure, so that all methods can match their declared class.

4. Filter system method calls

LLVM provides an interface to determine whether the current method belongs to a system class.

5. Problems with filtering business class implementation system methods

For all methods in the current class that are in the current class and trace back to the parent class in its inheritance chain, determine whether they belong to the system method. If they belong to the system method, they will be filtered out directly.

6. For the implementation of protocol methods, there is currently no effective means to identify them. The current solution is to directly filter out protocol methods, and all protocol methods are considered to have been called.

When extracting a method, determine which protocols the current interface follows, traverse the methods in the protocol, and determine whether it is a protocol method. If so, mark it as called.

7. Problems with subclasses implementing parent class protocols

Trace back the inheritance chain of the current class, determine the protocols it follows in the inheritance chain, and determine whether it is a protocol method.

8. When a normal business implements a protocol, it should be clearly marked that the current class follows the protocol such as interface <conformprotocol>. However, in actual scenarios, there are many codes that are not marked with conformprotocol when implementing the protocol. This will have an impact on the judgment of the protocol method. For example, 6.7 solutions are all failed

If there is a small amount of this problem in a component, there needs to be a clear protocol to follow when pushing the relevant parties to fix it. However, if there are many such scenarios for some components and not all of them will be repaired in the short term, temporary adaptation is required. For this type of component, collect all protocol methods of the protocol declared by the current component, and use the collected protocol methods and all declarations extracted by the current component to make a difference. There is a possibility of accidental damage, but the result is confident (component is only one dimension, and Related processing can be performed on its associated components, because sometimes the components it implements are not necessarily within the current component, which requires the dependencies of the current component).

There are many cases of useless methods, and some of them are listed for your reference.

6. Summary

This technology has actually been used in Baidu APP for a long time, because the author was previously responsible for the interface change review, component integrity verification, privacy compliance call chain analysis, etc. of Baidu APP, all relying on this technology. The identification of useless methods is just the author's work. An extension of its function that comes to mind when doing volume optimization. Of course, for the technical issues described above, the useless methods for dealing with details are obviously more delicate and have more cases. Subsequent articles will introduce Swift useless method analysis, interface change review, component integrity verification, privacy compliance call chain analysis, etc. one by one.

** ——END——**

References:

[1]libclang:https://clang.llvm.org/doxygen/group__CINDEX.html

[2]libtooling official documentation: https://clang.llvm.org/docs/LibTooling.html

[3]LLVM source code: https://github.com/llvm/llvm-project

Recommended reading:

Real-time interception and problem distribution strategy based on abnormal online scenarios

Extremely optimized SSD parallel read scheduling

The practice of AI text creation and publishing on Baidu App

DeeTune: Design and application of Baidu network framework based on eBPF

Baidu self-developed high-performance ANN search engine, open source

Fined 200 yuan and more than 1 million yuan confiscated You Yuxi: The importance of high-quality Chinese documents Musk's hard-core migration server Solon for JDK 21, virtual threads are incredible! ! ! TCP congestion control saves the Internet Flutter for OpenHarmony is here The Linux kernel LTS period will be restored from 6 years to 2 years Go 1.22 will fix the for loop variable error Svelte built a "new wheel" - runes Google celebrates its 25th anniversary
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10112383