Linking Process (4/13)

In the compilation of a C project, the compiler takes the C source file as a unit and translates each C file into a corresponding target file. Each target file generated is composed of code segment, data segment, BSS segment, symbol table and other sections. These sections are arranged sequentially from the zero offset address of the target file. The offset of the symbols in each section relative to the zero address is actually the address of each symbol. In this way, the variables and function names defined in the program are all With a temporary address.

In the subsequent linking process, each section in these object files will be disassembled and assembled again, and the starting address of each section will change, resulting in the addresses of symbols such as functions and global variables defined in each section. Changes occur and require re-modification, ie relocation.

Linking is mainly divided into 3 processes: segment assembly, symbol resolution and relocation.

Segmented assembly

insert image description here

The first step in the linking process is to assemble the individual object files in segments. The linker reassembles each relocatable object file generated by the compiler: puts the code segments of each object file together as the code segment of the final executable file; puts the data segments of each object file together, as the data segment of the executable. Other sections will be assembled in the same way.

The linker will create a global symbol table in the executable file, collect the symbols in the symbol table of each object file, and then put them into the global symbol table uniformly. Through this step, all symbols in an executable file have their own addresses and are stored in the global symbol table, but at this time the addresses in the global symbol table are still the original addresses in each object file, that is Offset from zero address.

How are the different pieces of code assembled during the linking process? The executable file generated by the link is finally loaded into the memory for execution, so where should it be loaded into the memory? Generally speaking, this address is the link starting address. An executable program must have an entry address. Generally, the code to be executed first should be placed first.The link address of the program and the assembly sequence of each segment can be specified through the link script

The link script is essentially a script file. In this script file, not only the assembly sequence, starting address, position alignment and other information of each segment are specified, but also information such as the output executable file format, operating platform, and entry address are specified. a detailed description. The linker assembles the executable file according to the rules defined by the link script, and finally saves the information in the ELF Header of the executable file in the form of section.

If in an embedded system, the starting address of the memory RAM is 0x60000000, when linking the program, you can specify a legal address in the memory as the linking starting address in the linking script. When the program is running, the loader first parses the ELF Header information in the executable file, verifies the running platform and loading address information of the program, and then loads the executable file into the corresponding address in the memory, and the program can run. Generally use the default linker script provided by the compiler.

View the default linker script with the command:

jiaming@jiaming-pc:~/Documents/CSDN_Project$ arm-linux-gnueabi-ld --verbose
GNU ld (GNU Binutils for Ubuntu) 2.34
  Supported emulations:
   armelf_linux_eabi
   armelfb_linux_eabi
using internal linker script:
==================================================
/* Script for -z combreloc */
/* Copyright (C) 2014-2020 Free Software Foundation, Inc.
   Copying and distribution of this script, with or without modification,
   are permitted in any medium without royalty provided the copyright
   notice and this notice are preserved.  */
OUTPUT_FORMAT("elf32-littlearm", "elf32-bigarm",
	      "elf32-littlearm")
OUTPUT_ARCH(arm)
ENTRY(_start)
SEARCH_DIR("=/usr/local/lib/arm-linux-gnueabi"); SEARCH_DIR("=/lib/arm-linux-gnueabi"); SEARCH_DIR("=/usr/lib/arm-linux-gnueabi"); SEARCH_DIR("=/usr/local/lib"); SEARCH_DIR("=/lib"); SEARCH_DIR("=/usr/lib"); SEARCH_DIR("=/usr/arm-linux-gnueabi/lib");
SECTIONS
{
    
    
  /* Read-only sections, merged into text segment: */
  PROVIDE (__executable_start = SEGMENT_START("text-segment", 0x00010000)); . = SEGMENT_START("text-segment", 0x00010000) + SIZEOF_HEADERS;
  .interp         : {
    
     *(.interp) }
  .note.gnu.build-id  : {
    
     *(.note.gnu.build-id) }
  .hash           : {
    
     *(.hash) }
  .gnu.hash       : {
    
     *(.gnu.hash) }
  .dynsym         : {
    
     *(.dynsym) }
  .dynstr         : {
    
     *(.dynstr) }
  .gnu.version    : {
    
     *(.gnu.version) }
  .gnu.version_d  : {
    
     *(.gnu.version_d) }
  .gnu.version_r  : {
    
     *(.gnu.version_r) }
  .rel.dyn        :
    {
    
    
      *(.rel.init)
      *(.rel.text .rel.text.* .rel.gnu.linkonce.t.*)
      *(.rel.fini)
      *(.rel.rodata .rel.rodata.* .rel.gnu.linkonce.r.*)
      *(.rel.data.rel.ro .rel.data.rel.ro.* .rel.gnu.linkonce.d.rel.ro.*)
      *(.rel.data .rel.data.* .rel.gnu.linkonce.d.*)
      *(.rel.tdata .rel.tdata.* .rel.gnu.linkonce.td.*)
      *(.rel.tbss .rel.tbss.* .rel.gnu.linkonce.tb.*)
      *(.rel.ctors)
      *(.rel.dtors)
      *(.rel.got)
      *(.rel.bss .rel.bss.* .rel.gnu.linkonce.b.*)
      PROVIDE_HIDDEN (__rel_iplt_start = .);
      *(.rel.iplt)
      PROVIDE_HIDDEN (__rel_iplt_end = .);
    }
  .rela.dyn       :
    {
    
    
      *(.rela.init)
      *(.rela.text .rela.text.* .rela.gnu.linkonce.t.*)
      *(.rela.fini)
      *(.rela.rodata .rela.rodata.* .rela.gnu.linkonce.r.*)
      *(.rela.data .rela.data.* .rela.gnu.linkonce.d.*)
      *(.rela.tdata .rela.tdata.* .rela.gnu.linkonce.td.*)
      *(.rela.tbss .rela.tbss.* .rela.gnu.linkonce.tb.*)
      *(.rela.ctors)
      *(.rela.dtors)
      *(.rela.got)
      *(.rela.bss .rela.bss.* .rela.gnu.linkonce.b.*)
      PROVIDE_HIDDEN (__rela_iplt_start = .);
      *(.rela.iplt)
      PROVIDE_HIDDEN (__rela_iplt_end = .);
    }
  .rel.plt        :
    {
    
    
      *(.rel.plt)
    }
  .rela.plt       :
    {
    
    
      *(.rela.plt)
    }
  .init           :
  {
    
    
    KEEP (*(SORT_NONE(.init)))
  }
  .plt            : {
    
     *(.plt) }
  .iplt           : {
    
     *(.iplt) }
  .text           :
  {
    
    
    *(.text.unlikely .text.*_unlikely .text.unlikely.*)
    *(.text.exit .text.exit.*)
    *(.text.startup .text.startup.*)
    *(.text.hot .text.hot.*)
    *(SORT(.text.sorted.*))
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf.em.  */
    *(.gnu.warning)
    *(.glue_7t) *(.glue_7) *(.vfp11_veneer) *(.v4_bx)
  }
  .fini           :
  {
    
    
    KEEP (*(SORT_NONE(.fini)))
  }
  PROVIDE (__etext = .);
  PROVIDE (_etext = .);
  PROVIDE (etext = .);
  .rodata         : {
    
     *(.rodata .rodata.* .gnu.linkonce.r.*) }
  .rodata1        : {
    
     *(.rodata1) }
  .ARM.extab   : {
    
     *(.ARM.extab* .gnu.linkonce.armextab.*) }
  .ARM.exidx   :
    {
    
    
      PROVIDE_HIDDEN (__exidx_start = .);
      *(.ARM.exidx* .gnu.linkonce.armexidx.*)
      PROVIDE_HIDDEN (__exidx_end = .);
    }
  .eh_frame_hdr   : {
    
     *(.eh_frame_hdr) *(.eh_frame_entry .eh_frame_entry.*) }
  .eh_frame       : ONLY_IF_RO {
    
     KEEP (*(.eh_frame)) *(.eh_frame.*) }
  .gcc_except_table   : ONLY_IF_RO {
    
     *(.gcc_except_table .gcc_except_table.*) }
  .gnu_extab   : ONLY_IF_RO {
    
     *(.gnu_extab*) }
  /* These sections are generated by the Sun/Oracle C++ compiler.  */
  .exception_ranges   : ONLY_IF_RO {
    
     *(.exception_ranges*) }
  /* Adjust the address for the data segment.  We want to adjust up to
     the same address within the page on the next page up.  */
  . = DATA_SEGMENT_ALIGN (CONSTANT (MAXPAGESIZE), CONSTANT (COMMONPAGESIZE));
  /* Exception handling  */
  .eh_frame       : ONLY_IF_RW {
    
     KEEP (*(.eh_frame)) *(.eh_frame.*) }
  .gnu_extab      : ONLY_IF_RW {
    
     *(.gnu_extab) }
  .gcc_except_table   : ONLY_IF_RW {
    
     *(.gcc_except_table .gcc_except_table.*) }
  .exception_ranges   : ONLY_IF_RW {
    
     *(.exception_ranges*) }
  /* Thread Local Storage sections  */
  .tdata	  :
   {
    
    
     PROVIDE_HIDDEN (__tdata_start = .);
     *(.tdata .tdata.* .gnu.linkonce.td.*)
   }
  .tbss		  : {
    
     *(.tbss .tbss.* .gnu.linkonce.tb.*) *(.tcommon) }
  .preinit_array    :
  {
    
    
    PROVIDE_HIDDEN (__preinit_array_start = .);
    KEEP (*(.preinit_array))
    PROVIDE_HIDDEN (__preinit_array_end = .);
  }
  .init_array    :
  {
    
    
    PROVIDE_HIDDEN (__init_array_start = .);
    KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*)))
    KEEP (*(.init_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .ctors))
    PROVIDE_HIDDEN (__init_array_end = .);
  }
  .fini_array    :
  {
    
    
    PROVIDE_HIDDEN (__fini_array_start = .);
    KEEP (*(SORT_BY_INIT_PRIORITY(.fini_array.*) SORT_BY_INIT_PRIORITY(.dtors.*)))
    KEEP (*(.fini_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .dtors))
    PROVIDE_HIDDEN (__fini_array_end = .);
  }
  .ctors          :
  {
    
    
    /* gcc uses crtbegin.o to find the start of
       the constructors, so we make sure it is
       first.  Because this is a wildcard, it
       doesn't matter if the user does not
       actually link against crtbegin.o; the
       linker won't look for a file to match a
       wildcard.  The wildcard also means that it
       doesn't matter which directory crtbegin.o
       is in.  */
    KEEP (*crtbegin.o(.ctors))
    KEEP (*crtbegin?.o(.ctors))
    /* We don't want to include the .ctor section from
       the crtend.o file until after the sorted ctors.
       The .ctor section from the crtend file contains the
       end of ctors marker and it must be last */
    KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .ctors))
    KEEP (*(SORT(.ctors.*)))
    KEEP (*(.ctors))
  }
  .dtors          :
  {
    
    
    KEEP (*crtbegin.o(.dtors))
    KEEP (*crtbegin?.o(.dtors))
    KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .dtors))
    KEEP (*(SORT(.dtors.*)))
    KEEP (*(.dtors))
  }
  .jcr            : {
    
     KEEP (*(.jcr)) }
  .data.rel.ro : {
    
     *(.data.rel.ro.local* .gnu.linkonce.d.rel.ro.local.*) *(.data.rel.ro .data.rel.ro.* .gnu.linkonce.d.rel.ro.*) }
  .dynamic        : {
    
     *(.dynamic) }
  . = DATA_SEGMENT_RELRO_END (0, .);
  .got            : {
    
     *(.got.plt) *(.igot.plt) *(.got) *(.igot) }
  .data           :
  {
    
    
    PROVIDE (__data_start = .);
    *(.data .data.* .gnu.linkonce.d.*)
    SORT(CONSTRUCTORS)
  }
  .data1          : {
    
     *(.data1) }
  _edata = .; PROVIDE (edata = .);
  . = .;
  __bss_start = .;
  __bss_start__ = .;
  .bss            :
  {
    
    
   *(.dynbss)
   *(.bss .bss.* .gnu.linkonce.b.*)
   *(COMMON)
   /* Align here to ensure that the .bss section occupies space up to
      _end.  Align after .bss to ensure correct alignment even if the
      .bss section disappears because there are no input sections.
      FIXME: Why do we need it? When there is no .bss section, we do not
      pad the .data section.  */
   . = ALIGN(. != 0 ? 32 / 8 : 1);
  }
  _bss_end__ = .; __bss_end__ = .;
  . = ALIGN(32 / 8);
  . = SEGMENT_START("ldata-segment", .);
  . = ALIGN(32 / 8);
  __end__ = .;
  _end = .; PROVIDE (end = .);
  . = DATA_SEGMENT_END (.);
  /* Stabs debugging sections.  */
  .stab          0 : {
    
     *(.stab) }
  .stabstr       0 : {
    
     *(.stabstr) }
  .stab.excl     0 : {
    
     *(.stab.excl) }
  .stab.exclstr  0 : {
    
     *(.stab.exclstr) }
  .stab.index    0 : {
    
     *(.stab.index) }
  .stab.indexstr 0 : {
    
     *(.stab.indexstr) }
  .comment       0 : {
    
     *(.comment) }
  .gnu.build.attributes : {
    
     *(.gnu.build.attributes .gnu.build.attributes.*) }
  /* DWARF debug sections.
     Symbols in the DWARF debugging sections are relative to the beginning
     of the section so we begin them at 0.  */
  /* DWARF 1 */
  .debug          0 : {
    
     *(.debug) }
  .line           0 : {
    
     *(.line) }
  /* GNU DWARF 1 extensions */
  .debug_srcinfo  0 : {
    
     *(.debug_srcinfo) }
  .debug_sfnames  0 : {
    
     *(.debug_sfnames) }
  /* DWARF 1.1 and DWARF 2 */
  .debug_aranges  0 : {
    
     *(.debug_aranges) }
  .debug_pubnames 0 : {
    
     *(.debug_pubnames) }
  /* DWARF 2 */
  .debug_info     0 : {
    
     *(.debug_info .gnu.linkonce.wi.*) }
  .debug_abbrev   0 : {
    
     *(.debug_abbrev) }
  .debug_line     0 : {
    
     *(.debug_line .debug_line.* .debug_line_end) }
  .debug_frame    0 : {
    
     *(.debug_frame) }
  .debug_str      0 : {
    
     *(.debug_str) }
  .debug_loc      0 : {
    
     *(.debug_loc) }
  .debug_macinfo  0 : {
    
     *(.debug_macinfo) }
  /* SGI/MIPS DWARF 2 extensions */
  .debug_weaknames 0 : {
    
     *(.debug_weaknames) }
  .debug_funcnames 0 : {
    
     *(.debug_funcnames) }
  .debug_typenames 0 : {
    
     *(.debug_typenames) }
  .debug_varnames  0 : {
    
     *(.debug_varnames) }
  /* DWARF 3 */
  .debug_pubtypes 0 : {
    
     *(.debug_pubtypes) }
  .debug_ranges   0 : {
    
     *(.debug_ranges) }
  /* DWARF Extension.  */
  .debug_macro    0 : {
    
     *(.debug_macro) }
  .debug_addr     0 : {
    
     *(.debug_addr) }
  .gnu.attributes 0 : {
    
     KEEP (*(.gnu.attributes)) }
  .note.gnu.arm.ident 0 : {
    
     KEEP (*(.note.gnu.arm.ident)) }
  /DISCARD/ : {
    
     *(.note.GNU-stack) *(.gnu_debuglink) *(.gnu.lto_*) }
}


==================================================

When compiling programs in the embedded bare metal environment, especially compiling the ARM underlying code, we often need to flexibly specify the link address according to the different hardware configurations, memory sizes and addresses of the development board, or display the specified link script, and sometimes even write the link by ourselves. script.

The link script u-boot.lds compiled from u-boot source code is generally placed in the top directory of u-boot source code. The link script vmlinux.lds compiled by the Linux kernel is generally placed arch/arm/boot/compressed/under the directory. In the IDE, you can directly set the start address of the code segment and data segment through the Debug Setting interface, and you can set the entry address of the program through the layout option of the linker.

symbol resolution

A company's project is usually jointly developed by a software team composed of multiple people. In a project, the functional requirements are generally defined by the product manager, and the system analysis and module division are performed by the architect, and then the specific implementation of each module is assigned to different personnel. Developers may have a problem in the programming of their respective modules: global variables and functions located in different modules or different files may have duplicate name conflicts.

When these global variables are defined in multiple files, the linker will find out during the linking process that the same global variable name or function name is defined in each file, and a symbol conflict occurs, which one in the executable file should be used in the end? one? The linker has special symbol resolution rules to resolve symbol conflicts:

Strong symbols and weak symbols: function names and initialized global variables are strong symbols, while uninitialized global variables are weak symbols. Strong symbols are not allowed to be defined multiple times. Strong symbols and weak symbols can coexist in a project. When strong and weak symbols coexist, the strong symbols will overwrite the weak symbols, and the linker will choose the strong symbol as the final symbol in the executable file.

The linker also allows multiple weak symbols to coexist in a project. During program compilation, when the compiler analyzes the uninitialized global variables in each file, it does not know whether the symbol is used or discarded during the linking phase, so during program compilation, uninitialized global variables are not directly Placed in the BSS segment, these weak symbols are placed in a temporary block called COMMON, marked with an undefined COMMON in the symbol table, and no storage space is allocated for them in the object file.

During linking, the linker will compare the weak symbols in multiple files, and choose the one that occupies the largest space as the final symbol in the executable file. At this time, the size of the weak symbol has been determined and it is directly placed in the executable file. in the BSS section of the

Under normal circumstances, initialized global variables and function names are strong symbols by default, and uninitialized global variables are weak symbols by default. If there are special requirements in the project, some strong symbols can also be displayed as weak symbols.

__attribute__((weak)) int n = 100;

Corresponding to strong symbols and weak symbols, there are also concepts of strong references and weak references. In a program, we can define multiple functions and variables. Variable names and function names are symbols. The essence of these symbols, or the values ​​of these symbols, are actually addresses. In another file, we can call the function by the function name and access the variable by the variable name. Calling a function or accessing a variable through a symbol is usually called a reference. Strong symbols correspond to strong references, and weak symbols correspond to weak references.

In the process of program linking, if the reference to a symbol is a strong reference and its definition cannot be found during linking, the link will not report an error and will not affect the generation of the final executable file. If the executable file does not find the definition of the symbol at runtime, it will report an error.

Using the linker's processing rules for weak references, it is possible to determine whether a symbol has a definition before referencing it. The advantage of this is that when we refer to an undefined symbol, no error will be reported during the linking phase, and running errors can also be avoided by judging and running during the running phase.

In the process of module implementation, we can declare a series of API functions provided to users as weak symbols:

  • When we are not very satisfied with the implementation of some API functions in the library, or there are bugs in these APIs, and there are better implementations, we can customize the functions with the same name as the library functions, and calling them directly will not conflict.
  • During the implementation of the library, some unfinished APIs in some extended function modules can be defined as weak references. Before the application program calls these APIs, it must first judge whether the function is implemented, and then call and run it. The advantage of this is that when a new version library is released in the future, no matter whether these interfaces have been implemented or deleted, it will not affect the normal link and operation of the application.

reset

After symbol resolution, we have solved the problem of multi-file symbol conflicts in the linking process. After processing, although each symbol in the symbol table of the executable file has been determined, there is still a problem: each symbol in the symbol table The symbol value is the address of each function, global variable, or the original pointer in each object file, and it is also an offset based on zero address. After the linker reassembles each object file, the starting address of each segment is There has been a change. The symbol address of each segment also changes accordingly. Next, modify the pointers of the symbols in the global symbol table in the executable file, and update their real addresses into the symbol table. After the modification is completed, when we want to call a function or access a variable through a symbolic reference, we can find their real addresses in memory.

Through the relocation table of the file, the linker can know which symbols need to be relocated. The core work of relocation is to correct the symbol address in the instruction. It is the last step in the linking process, and it is also the core and most important step. The operations of the first two steps are actually for this step.

During the compilation phase, the compiler will generally not report an error when it encounters undefined symbols during the process of generating target files from each C source file, and the compiler will think that these symbols may be defined elsewhere. In the link phase, the linker will not find the definition of the symbol elsewhere, and will report a link error.The compiler will collect these undefined symbols during the linking phase and generate a relocation table to tell the linker that these symbols are referenced in the file, but no definition is found in this file, please check during the linking process.

There is an important piece of information in the relocation table: the offset address of the symbol that needs to be relocated in the instruction code. When the linker corrects the reference of each symbol in the instruction code, it can only find them from the vast binary code based on this address information. . The linker reads the relocation table in each object file, performs symbol relocation according to the new address of these symbols in the executable file, modifies the addresses referencing these symbols in the instruction code, and generates a new symbol table.

The entire compilation process is over, and we get an executable object file.

Guess you like

Origin blog.csdn.net/weixin_39541632/article/details/132031475