What exactly is clang? What exactly is the difference between gcc and clang?

Recently, I found that I am not very clear about the difference between GNU GCC and Clang, which affects some implementation and learning, so I took advantage of these two days to study it carefully.

During this research process, I found that many problems actually stem from the language (not referring to the programming language, but the distortion of Chinese and English translations) and conceptual understanding.

If you search online clang, some people will tell you that this is a frontend (frontend), and then extract some compiler introductions from the book, and then list a bunch of tables for comparison, without giving a detailed explanation of the principles and mechanisms and presentation. So there will be more problems popping up at this time:

  • Why clanga front end? Isn't it a complete compiler? If clangit is a complete compiler, why is it called the front end? If it's not complete, what is the backend?
  • What exactly is the definition of a compiler? I feel that the definition of the compiler in the book gccis different from the actual one.

Let me explain here: here gccrefers to the commands you can use directly in Ubuntu and other Linux distributions (from the GNU software group), if it refers to the project, it will be written as "GNU GCC". If it refers to llvm-gcc, it will not be abbreviated as gcc.

This article will gradually answer this series of questions. In the process, it will not only let you figure out what it is , but also let you know more clangabout the compilation process, compiler and LLVM.gcc

A program like gccthis modern name "compiler" is a collection of tools

One thing needs to be understood first, which can be regarded as the answer to most of the above questions or the source of misunderstanding: a modern "compiler" like this is a collection of tools, including a preprocessor, a compiler, and will call the assembler, link gccMultiple tools such as compiler or loader, rather than a single compiler (this kind of conflict between terms and nouns is also one of the important reasons for misleading).

The answer is stated first to allow readers to go to the explanation with the answer, so that they can understand better.

What exactly is a compiler (or what is the compilation process)

As mentioned above, "compiler" is a term that often expresses conflicts: in many speeches, blogs, textbooks and professional books, a compiler is described as "a program that converts source code into an executable program" (such as gccthis A "compiler" can directly program the source code into an executable program) . This statement succinctly and precisely describes gccwhat happens when you use the command, but it's not a compiler definition.

Let's take a look at the introduction to compilers in the most classic compilation-related book "Compilation Principles" (that is, the Dragon Book), which is also the most classic meaning of compilers:

In simple terms, a compiler is a program that reads a program in a certain language (this language is the source language) and translates it into an equivalent program in another language (this language is called the target language). One of the most important provisions of the compiler is to report errors found in the "source" program during translation.
If the object program is in machine language, then the object program is an executable program.

Then go back to the definition in "Compilation Principles", which is actually: the source code we wrote is converted by the compiler to obtain code in another language, and if the converted code is in machine language, then the object code is executable. program. But if there are multiple source code files or libraries with external links, then it may be a shared object.

"Code" refers to a bunch of numbers, letters, and symbols in English, and is translated into "code" or "code" in Chinese. The essence of an
application is a bunch of binary files spelled out in machine language.

That is, a compiler is actually a program that converts one language into another .

However, according to the standard compilation process in recent decades, a compiler refers to a program that .cconverts files such as files into .sfiles. For the convenience of explanation, unless otherwise specified, the following "compiler" is defined according to this.

Under this definition, the internal workflow of the compiler is roughly as follows:

C 语言代码
C 前端
优化器
C++ 语言代码
C++ 前端
Objective-C 语言代码
Objective-C 前端
X86后端
X86汇编代码
ARM 后端
ARM 汇编代码

The compiler can generate assembly code for the specified platform. Then the assembler converts the assembly code into machine language, and finally the linker connects it into an executable program.

In addition, there are a few points to add:

  1. The various language codes here are preprocessed;
  2. The front-end of a language generally refers to the lexical parser (Lexer) and the parser (Parser). The front-end will convert the source code step by step (from high-level to low-level) into the intermediate expression (IR) required by the optimizer. This is a multi-analysis realized by the device.
  3. Generally, an "AST (Abstract Syntax Trees)" is listed separately in front of the optimizer. This is a high-level intermediate expression, which is basically the reorganization of the source code.
  4. The optimizer is sometimes called the middle end. The optimizer not only improves performance, but also as a middle end can make the front and back ends better separated, increasing the possibility of cross-compilation.

The process from source code to high-level intermediate expression, and then from intermediate to low-level is roughly as follows:

Flow from source code to high-level intermediate representations, and from intermediate to low-level

Here is an article that introduces it in more detail: "Intermediate Representation"

The process of converting source code into an executable program

The complete process of converting source code into an executable program is what we usually call the "compilation process". Over the past few decades, this standard process has been roughly as follows (rounded rectangles represent code, rectangles represent various processors):

源代码
预处理器
调整之后的代码
编译器
汇编语言代码
汇编器
可调整的机器语言代码
连接器或加载器
可执行程序

It can be seen that from the source code to the executable program, it has to go through the preprocessor (preprocessor), compiler (compiler), assembler (assembler) and linker (linker) or loader (loader), and the compiler is only responsible for the source The code is converted into the function of the corresponding assembly code.

Process display: gcc and supporting cpp, as, ld processing conversion program

In addition to the compiler and assembler of the several processing conversion programs mentioned above, the other three are estimated to be very rarely heard. The following uses the most classic C language gccto introduce this process, gccthe included preprocessor is cpp, and the assembler asand linker are also called ld.

For the introduction of the three, and the detailed process of how to carry out each step in the compilation process, you can read my other article "Use gcc to show the complete compilation process" . This article also introduces some gccHow to do it. This article is highly recommended to take a look after reading this, otherwise you may only understand the literal content. The content of the article was originally intended to be placed here, but it will increase the number of words to 20,000 words, which will take too long to read .

gcc internal workflow

The internal workflow of gcc is as follows, and the preprocessing process is ignored here:

C 语言代码
C 前端
AST 代码
优化器
LLVM IR 码
C++ 语言代码
C++ 前端
Objective-C 语言代码
Objective-C 前端
X86 后端
X86 汇编代码
ARM 后端
ARM 汇编代码

Workflow inside Clang

With the development and progress of the times, the old-fashioned compilation process is not enough:

  1. Performance optimization requires too much manpower and material resources (the current assembly language is much more complicated than before, the classic PDP-11 manual has less than 30 pages about instructions, but now the Intel X86 instruction manual is only 2500 pages);
  2. The development consumption for each machine is high (for example, compiling the same program on ARM and X86);
  3. Compiler "plugins" are not enough (sometimes new optimizations or processing are needed).

When you see this, you will understand that the front end here refers to the front end of the compilation process of the entire C language family, not the front end of a compiler. So clang is a full compiler that converts .cto .sa file, but calls the assembler and compiler to produce the final executable.

As a compiler, clang can convert the C-family language you wrote into LLVM IR (a low-level language), then convert and output a .sfile, and then call the assembler (or other assembler) in the LLVM project to assemble it into an .oobject file (that is, the "assembly stage" mentioned above), and finally call the linker to connect and output an executable program.

That is, the internal process of the compiler described earlier becomes the following process, and the preprocessing process is also omitted here:

C 语言代码
C 前端
AST
优化器
LLVM IR 码
C++ 语言代码
C++ 前端
Objective-C 语言代码
Objective-C 前端
后端llc
X86 汇编代码
ARM 汇编代码

clang后面使用的的汇编器和连接器,既可以使用 LLVM 集成,也可以使用 GNU 的,比如连接器可以使用 LLVM 集成的的lld,也可以使用 GNU 的ldgold,以及 MSVC的link.exe。不过默认情况下是使用 LLVM 集成的。

如果你好奇更详细 Clang 工作流程,和每一步的操作,比如说什么选项对应的是编译过程的某一步,可以看看这篇文档《An Overview of Clang》,我就不单独写博客了。

这种编译方式对于适配不同平台来说非常方便。当出现一个新的平台,只要将指令与 LLVM IR 对应即可,完全不用开发者去写一个全新的优化器和代码生成器去将源代码转换成汇编代码,省时省力。

为什么clang是一个前端?难道它不是完整的编译器吗?如果clang是完整的编译器的话,那么为什么叫前端呢?如果它不是完整的,那么后端是什么呢?

Clang 是一个完整的编译器,也是一个前端。不过是将源代码转换成可执行程序流程的前端,而不是编译器的前端。如果说是编译器的前端,那是预处理器、词义分析器(Lexer)和语法分析器(Parser)等部分构成的。

clang对应的后端指的是 LLVM 内含的,或者 GNU 等软件组的连接器、编译器等工具,这些工具负责将汇编代码汇编、连接成最后的可执行文件。

编译器的定义到底是什么?感觉书上编译器的定义和实际的gcc有所不同

关于编译器的定义前文有详细的解释,现在一般情况下“编译器”指的是从将.c等文件转换成.s文件的程序。

实际上编译器,比如gcc包含了一些工具(比如预处理器),也会去调用其他的工具(汇编器和连接器),所以与定义有所不同。

LLVM 项目是干什么项目?

前文提到,很多编译器是需要多个中间表达(IR)的,这些中间表达可能是词汇分析器生成的,也可能是语义分析器生成的,就很不统一,这就导致更新指令和优化性能随着数量的大幅提升成为了一件很困难的事情。

LLVM 全名“Low-Level Virtual Machine”,是一架构和中间表达的实现。而 LLVM 项目最初是一套围绕着 LLVM 代码的工具,C 语言和对应的 LLVM 代码如下(源自Chris Lattner 的《Architecture for a Next-Generation GCC》):
Please add a picture description

LLVM 代码有三种用途:

  1. 编译器的中间表达;
  2. 存放在硬盘里的位码(bitcode);
  3. 人类可读的汇编语言表达

这三种用途实际上都是等价的,要么能共用,要么有工具可以很轻松的转换,这点就让 LLVM 兼容新的机器、优化性能、开发新的语言,甚至是反汇编都是很容易的。

整个项目最核心内容其实就是 LLVM IR。LLVM IR 旨在成为某种“通用IR”,希望足够低级,可以将高级代码干净地映射到 LLVM IR(类似于处理器使用的指令是“通用IR”,允许将许多种不同的语言映射到这些汇编语言)。这给使用 LLVM IR 的编译器带来了性能很不错提升。

关于 LLVM 设计更详细的介绍还是请看文档:《LLVM Language Reference Manual》

关于 LLVM 带来的性能提升可以看 Intel 的这篇文章:《Intel® C/C++ Compilers Complete Adoption of LLVM》

reinders-2021-LLVM-benchmarks-01

gcc和clang有什么区别?

LLVM 早期有一个名为llvm-gcc的项目,它和 GNU GCC 的最大区别就在于:llvm-gcc在编译器最后使用的是 LLVM 作为最低一级的中间表达,而不是 GNU GCC 使用的的 RTL 作为最低一级的中间表达,所以llvm-gcc编译器的最后一部分是处理 LLVM IR,而不是处理 RTL(Register Transfer Language)。

其他方面,llvm-gccgcc一样将会输出一个汇编文件,工作原理也一样。不过可以通过使用-emit-llvm选项来让llvm-gcc输出 LLVM 字节码。

后来 LLVM 创始人 Chris Lattner 在苹果的时候就开创了一个中间表达全部使用 LLVM 作为中间表达的 C 语言家族的编译器,也就是 Clang。

虽然clang淘汰了llvm-gcc,虽然现在还是有llvm-gcc,但是使用率和性能都不如clang。也正是因为 LLVM IR,Clang进行反汇编也很方便。

下面是 Chris Lattner 简历中提到 Clang 诞生的部分(https://www.nondot.org/sabre/Resume.html#Apple):

Screenshot of Chris Lattner's resume

这里字太小了,机翻一下:

machine flip

总结一下,gccclang的区别在于:clang的各个中间层均为 LLVM IR,而gcc的各个中间层为 TRL 或其他一些事物。

It should be noted here that it is not the same thing as the LLVM compiler described by llvm-gccLLVM founder Chris Lattner in "Architecture for a Next-Generation GCC" . This LLVM compiler is not the same thing as the later Clang . The schematic diagram of the LLVM compiler in the paper is as follows:

Please add a picture description

The difference is that a connection layer is added in the middle, and two connections are made in the entire compiler. But obviously, according to Intel's data, the performance and effect of the LLVM compiler are similar to those of GNU GCC. But now you can still get it up and down on GitHub, the latest version is 16: https://github.com/llvm/llvm-project/releases/tag/llvmorg-16.0.0

You can choose to clangdownload with:
Please add a picture description

It can also be downloaded separately:
Please add a picture description

In the process of writing this blog, I have a deeper understanding of the use and understanding of compilers gcc. clangHowever, since this article is too long, it is inevitable that there will be disclosures. If you find errors (mistakes, typos, some things you forgot to delete, etc.) during the reading process, please comment and let me know~

Hope to help those in need~

Guess you like

Origin blog.csdn.net/qq_33919450/article/details/130911617