Chapter 4 C Language

In the previous chapters, the boot record mbr.bin was written in x86 assembly language, and the BIOS was booted to memory 0x07c00 to execute successfully. Then use x86 assembly language to write the minimal version of start.bin of myos, and let mbr.bin be loaded into memory 0x10000 and executed successfully. Next, you only need to add the basic functions of the operating system on the basis of start.bin. start.bin is written in x86 assembly language. It is not impossible to write all myos in assembly language, but this is a challenge for everyone. The high-level language for writing the operating system is of course the preferred C language. This chapter mainly explains the basic issues about C language programming and the mixed programming issues of x86 assembly and C language.

4.1 Introduction to gcc

Richard Stallman is a champion of free software. In the 1980s, free and freely circulating software collapsed under the strong pressure of the commercialization of the software industry. Stallman of the MIT AI Laboratory resigned from MIT for free software and published the famous GNU Manifesto in 1985. : Develop a completely free and Unix-compatible operating system GNU (GNU is Not Unix). He later established the Free Software Foundation (Free Software Foundation, FSF), and in 1989 drafted the GNU General Public License (General Public License, GPL).
Although the operating system kernel in the GNU project was slow to progress and was later replaced by the Linux kernel, GNU has developed a number of very effective tools, most notably the editor Emacs and the compiler gcc.
The original gcc is a portable, optimized, open source C compiler that supports ANSI C. After decades of development, gcc has changed from the original GNU C Compiler to the GNU Compiler Collection.
gcc is a super compiler that can compile executable programs on a variety of hardware platforms. There are more than 40 kinds of architectures supported by gcc, and the common ones are X86, Arm, PowerPC, MIPS, RISC-V, etc., and its execution efficiency is comparable to that of The general compiler is 20%~30% higher than that.
In addition to supporting C language, gcc also supports many other languages, including C++, Ada, Java, FORTRAN, Pascal, etc. Of course, this chapter only talks about the compilation of C language programs.
gcc can also generate various types of output files on demand, including intermediate files, assembly files, object files, and executable files.
The basic command format of gcc is:
gcc [options] [filenames]
Among them, options is an option parameter recognized by gcc, and filenames gives the relevant file names. gcc is rich in functions, with more than 100 option parameters, here are only some of the most commonly used option parameters:
1. The compilation process of gcc without parameters
needs to go through four interrelated steps: preprocessing (Preprocessing), compilation (Compilation) ,
Assembly (Assembly) and Linking (Linking), gcc without parameters is equivalent to doing all these four steps, that is, gcc directly
preprocesses, compiles, assembles and links the C language source program into an executable file, and the generated executable file The execution file name defaults to a.out
(under LINUX) or a.exe (under Windows). For example:
gcc writeA.c
means to directly compile and link writeA.c into an executable file, and the name of the executable file is a.exe under Windows cmd.
gcc can compile and link multiple C language source program files into an executable file. At this time, it is required that there must be one and only one main function in these source program files. This is because gcc defaults that the main function is the entry point for the program to start executing when linking. function. For example, there is a main function in mc and the add function is called, and the add function is defined in nc, then the following commands can be used to compile and link mc and nc into an executable file: gcc mc The name of the executable file output by nc
is
default a.exe.
2. -E parameter
The -E parameter requires gcc to only preprocess the C language source program. In the preprocessing stage, gcc calls cpp.exe to expand the content of the header files in each C language source program to the location of the header files, and preprocesses some macro definitions. This stage generates an intermediate file with a text extension of .i. For example:
gcc -E writeA.c
will output the preprocessed content on the screen, while:
gcc -E writeA.c -o writeA.i
will save the preprocessing result in the intermediate file writeA.i. The preprocessing result can also be saved in any file, but gcc only recognizes the intermediate file of .i. For example:
gcc -E writeA.c -o abc
Here, although the preprocessing result of writeA.c is saved in the file abc, gcc will not consider abc to be the intermediate file after preprocessing, but only add the extension. i, gcc only recognizes it.
3. -S parameter
The -S parameter requires gcc to compile the input file. In the compiling stage, gcc invokes ccl.exe to compile the input file into an assembly language file with a text extension of .s. For example:
gcc -S writeA.c -o writeA.s
Since the input file writeA.c is a C language source program, gcc needs to perform preprocessing before compiling, and save the result in the assembly file writeA.s. You can also:
gcc -S writeA.i -o writeA.s
Since the input file writeA.i is already a preprocessed intermediate file, gcc skips the preprocessing and only compiles, and then saves the result in the assembly file writeA. s in. Although abc in the above example is also a preprocessed intermediate file, but:
gcc -S abc -o writeA.s
will not work, because the extension is not .i, gcc will not compile abc as the result of preprocessing.
4. The -c parameter
The -c parameter requires gcc to compile the input file. In the assembly phase, gcc calls as.exe to assemble the input file into a binary object file with the extension .o. For example:
gcc -c writeA.s -o writeA.o
means that gcc assembles the preprocessed and compiled assembly file writeA.s into the object file writeA.o. It is also possible:
gcc -c writeA.i -o writeA.o
means that gcc compiles the intermediate file that has only been preprocessed first, and then compiles it into the target file writeA.o. It can also be:
gcc -c writeA.c -o writeA.o
means that gcc preprocesses the C language source program first, then compiles it, and finally compiles it into the object file writeA.o.
Here also need to explain, although the object file format compiled by gcc should be ELF format, but MinGw is the gcc platform under Windows, so the object file format compiled by gcc under MinGw is PE instead of ELF.
5. -o parameter
-o parameter is used to point out the output file name.
6. -g parameter
If you want to use the gcc tool gdb to debug the source code, you must add this parameter when compiling.
7. -O parameter
The -O parameter compiles and links the program with optimization. With this option, the entire source code will be optimized during compilation and linking, so that the execution efficiency of the executable file generated in this way can be improved, but the speed of compilation and linking will be correspondingly slower, and the debugging of the executable file will be slower. It has a certain impact, causing some execution effects to be inconsistent with the corresponding source file code.
8. -O2 parameter
-O2 parameter is better than -O to optimize compilation and linking, but the whole compilation and linking process will be slower.
9. -I parameter The
-I parameter is used to specify the subdirectory where the header file is located, and is a parameter used in the preprocessing process. For header files enclosed in angle brackets (<>), the preprocessor cpp will search for the corresponding files in the system default header file directory; for header files enclosed in double quotes (""), the preprocessor cpp will first Search in the current directory where the source program is located, if not found, search in the directory specified by the I parameter.
10. -L parameter
The -L parameter is used to specify the subdirectory where the library file is located, and it is a parameter used in the linking process. When linking, the linker ld searches for the required library files in the default subdirectory of the system. If any library files are not in the default subdirectory, the -L parameter specifies the subdirectory to be searched.
11. -l parameter
The -l parameter is used to specify the name of the function library required for linking, and the function library is located in the system default subdirectory or the subdirectory specified by the -L option. gcc stipulates that all function library names must start with the "lib" string, so the three letters lib can be omitted when specifying the linked function library file name with the -l option. For example, -lm links the static math library named libm.a.
12. -v parameter
-v parameter outputs the detailed process of gcc work.
13. -Werror parameter
The -Werror parameter requires gcc to treat all warnings as errors, and gcc will stop compiling where warnings are generated, forcing programmers to modify their own code.

4.2 Basics of C language

We will not talk about the basics of C language here, but just talk about the header files and library files of C language. I hope it can help you sort out the structure of C programs and lay a solid foundation for writing or reading C language programs in the future.
We know that a C language program is a combination of multiple functions, and the most important function is the main function. The main function is the entry function for program execution. No matter how complicated a C program is, everything starts from the main function.
Other functions can be called in the main function, and other functions can be called in other functions.
Except for the main function, any C language function needs to have three stages of declaration, definition and call when used.
The function declaration is mainly used to let the compiler check the consistency of the parameter types of the called function. (1) If a function is defined before the main function in a C language source program file, the function does not need to be declared, which is equivalent to the declaration and definition together; (2) If a function is defined after the main function (such The program structure is clearer), the function must be declared before the main function; (3) more often, a function is defined in another C language source program file, it must be defined in the C language source program that calls the function (4) As all the C language library functions are packaged in different library files, it is of course necessary to declare the library functions in the C language source program file that calls the library functions.
Header files are used to declare functions.
The definition of a function is to describe the function of this function. In a C language source program file, the function definition can be placed before the main function or after the main function, but no matter whether it is before or after, this way of writing makes this function only be called in this file. If you want a function to be freely called in multiple different C language source program files, you need to define the function separately in a C language source program file, and then include this file when compiling and linking. It is also possible to compile multiple C language source program files that define a single function into multiple object files, and then archive them in a library file, and finally only need to include one library file when compiling and linking.
Library files are used to define functions.
The famous Hello World is the first C language source program you know, in which the main function only calls the function printf. Should the printf function be declared? Be sure to declare, the specific statement is in the stdio.h header file, so the Hello World program must include the stdio.h header file. Should the printf function be defined? It must be defined, the specific definition is in printf.o
, and printf.o is archived in the library file libc.a. In the LINUX operating system, the default header file path of gcc is /usr/include, the default library file path is /usr/lib, and the default C language function library is libc.a (static library) or libc.so (dynamic library) ). This textbook uses the MinGw platform for gcc in the Windows operating system. According to the installation path in Chapter 1, if it is installed on the D drive, the default header file path is D:\MinGw\include, and the default library file path is D: \MinGw\lib. The non-default header file and library file path should be pointed out by the "-I" parameter and "-L" parameter in gcc, and the non-default library file should be pointed out by the "-l" parameter.
With the above basis, when you read any C language source program, you can find the declarations and definitions of all functions, so that you can read and understand the program. Conversely, you can also organize your own header files and library files to make your C language programs read more clearly, call them more conveniently, and look taller.

4.3 Introduction to GNU Binutils

In order not to add more burden, we will not introduce it in detail here, just learn a few commands.

nasm mbr.asm -o mbr.bin: command to assemble mbr.asm file into mbr.bin using NASM assembler
nasm -f elf start.asm -o start.o: Use the NASM assembler to assemble the start.asm file into a start.o command
gcc -c myos.c -o myos.o: Use the GCC compiler to compile the myos.c file into myos.o command
ld -s --entry=start -Ttext=0x0 start.o myos.o -o myos.exe: command to link start.o and myos.o into myos.exe using linker. –entry=start specifies that the entry point of the program is start, and -Ttext=0x0 specifies that the loading address of the program is 0x0.
objcopy -O binary myos.exe myos.com: command to convert myos.exe to myos.com using the objcopy tool
writea mbr.bin A.img 1: command to write mbr.bin to the first sector of A.img

4.4 Mixed programming of 8086 assembly and C language

As mentioned earlier, the development of myos requires mixed programming of assembly language and C language. Assembly language is mainly used to write the bottom-level startup code, interrupt handler and input and output program, while C language is used to write the kernel code and application program of myos.
Before formally writing myos, this section introduces programming in C language. The basic idea is to jump from start.asm in Chapter 3 to the function written in C language program to continue execution. To review, when the system is powered on, the BIOS is executed first, the MBR is executed after the BIOS boots the MBR, and the start.bin is executed after the MBR loads start.bin. In start.bin, there is nothing to do after "my os is running!" is displayed on the screen, but an endless loop. Now write a function in C language, and then jump to the function from start.bin to continue execution.

4.4.1 Several problems of mixed programming

1. 16-bit code
By default, nasm will assemble the assembly language program into 32-bit code. Since myos runs in real address mode, it must be 16-bit code, so nasm is required to compile the assembly language program into 16-bit code. It is realized by adding the "BITS 16" keyword in the assembly language program. Similarly, GCC will compile C language programs into 32-bit codes by default, and myos requires C language programs to be compiled into 16-bit executable codes, which is achieved by adding the "code16gcc" keyword to the C language program . See the code example for the specific format.
2. Symbol reference
If you want to refer to the symbols (variables and functions) in the assembly language program in the C language program, you must first define the symbol in the assembly language program. You only need to write the symbol before a certain line of statement. You can add or not to add a colon; then use the global keyword to declare it as a global symbol at the beginning of the assembly language program; finally use the extern keyword to declare it as an external symbol in the C language program, and then you can use it in the C language program Cited. Note that an underscore is added before the symbol when defining and declaring in the assembly language program, and no underscore is added before the symbol when declaring and quoting in the C language program.
If you refer to the symbols (variables and functions) in the C language program in the assembly language, you must first define the variable or function in the C language program, and use the global keyword to declare it as a global symbol; then use it in the assembly language program The extern keyword declares it as an external symbol, which can be referenced in assembly language programs. Note that a single underscore is added before a symbol when defining and declaring in a C language program, and a double underscore is added before a symbol when declaring and quoting in an assembly language program.
The text may not be clearly expressed, but the reference code example will make it clearer.
3. Parameter passing
When passing parameters between the caller and the callee, it is mainly done through the stack, which requires to figure out the order and width of the parameters on the stack. In 16-bit code, the problem is relatively simple and will be discussed later when it is encountered in detail. Although the return address between the caller and the callee is also completed through the stack, the situation is a little more complicated. The main reason is that when nasm compiles assembly language programs into 16-bit codes, all instructions are in 16-bit mode, but when gcc compiles C language source programs into 16-bit codes, a large number of instructions will use 32-bit data mode, that is, It is said that many instruction bytes will have a byte with a value of 0x66 in front of them, including call and ret instructions. In order not to cause confusion and cause misplacement of parameter passing, it is necessary to change the calling method or manually add 0x66 bytes. Specifically include the following four situations:
1. If the calling function is written in assembly language and the called function is written in C language, it can be realized by modifying the call instruction in the calling function to a combination of push instruction and jmp instruction. Push the return address onto the stack, and then jump directly to the called function name.
2. If the calling function is written in assembly language and the called function is also written in assembly language, use the call instruction in the calling function and the ret instruction in the called function.
3. If the calling function is written in C language and the called function is written in assembly language, it is sufficient to directly call the symbol name without underline in the assembly program in the calling function, and add A pseudo-instruction "DB 0x66", that is, manually add 0x66 bytes.
4. If the calling function is written in C language and the called function is also written in C language, just call and return in the normal way.

4.4.2 An Example of Hybrid Programming

Next, let's write an example of mixed programming of assembly language and C language. Although very simple, this example contains the essentials of hybrid programming. At the same time, this example is also the most basic framework of myos. As we will see later, the more and more abundant myos only needs to add bricks and tiles to this framework.
1. Start.asm
is in the myos4 subdirectory, first modify start.asm.
When compiling mbr.asm and start.asm before, there was no "-f" option, that is, the format of the target file was not specified, and nasm compiled them into 16-bit pure binary unformatted executable files by default, which happened to be is what we need. Now if start.asm is to be linked with the C language program, start.asm must have a format, because only with a formatted object file, the ld command can recognize various information in it and link them together by category. Once there is a format, nasm will compile it into 32-bit code by default. If 16-bit code is required, use the "BITS 16" keyword to tell nasm as mentioned above.
As you can imagine, when using the ld command to link start.asm and a C language program into an executable file, the executable file should have a program entry, that is, where the program starts to execute, whether it starts from start.asm or Starting with a C language program, this needs to be specified by you. Because our program structure is to call the function in the C language program in start.asm, so of course it starts to execute from start.asm. Therefore, the address symbol of the first instruction in the start.asm program needs to be declared as global with the global keyword, and specified with the "-entry" option in ld. The address symbols of the program entry are defined by themselves, and there are no special requirements. Generally, they are meaningful symbols such as start, entry, and begin.
The end of the start.asm assembler program in Chapter 3 is two assembly statements that realize an infinite loop. Now change it to jump to __mymain for execution. Note that there are double underscores here, and mymain is a function defined in a C language program. name, and declare it as an external symbol with extern at the beginning of start.asm. After the start.asm file, add a program. The first assembly statement is marked with a symbol "_puts", and the last two statements are DB 0x66 and RET, so this program represents a subroutine named _puts. The function of this program is to print a parameter string, and the address of the string is taken out from the stack. The puts function needs to be called in the C language program, so it needs to be declared as global with the global keyword in start.asm.
The modified start.asm looks like this:

	[BITS 16]
	GLOBAL start
	EXTERN __mymain
	GLOBAL _puts
start: 
	JMP entry
	entry: MOV AX,CS
	MOV DS,AX
	MOV ES,AX
	MOV AX,0
	MOV SS,AX
	MOV SP,0x7e00
	LEA SI,msg
	ploop: LODSB
	CMP AL,0
	JE fin
	MOV AH,0x0e
	MOV BX,0x0f
	INT 0x10
	JMP ploop
fin: 
	JMP _ __mymain
	msg: DB "myos is running!-start.asm",0x0a,0x0a,0x0a,0
_puts: ;在屏幕光标位置打印参数字符串
	MOV BP,SP
	MOV SI,[BP+0x04]
loop2: 
	LODSB
	CMP AL,0
	JZ loop3
	MOV AH,0x0e
	MOV BL,0x07
	INT 0x10
	JMP loop2
loop3: 
	DB 0x66
	RET

After initializing the registers and the stack, the program prints a character string, then jumps to mymain to continue execution, and finally defines a function puts. Presumably, you can understand through the previous study, so I won't explain it here.
2. myos.c
Write the C language program myos.c under the myos4 subdirectory.
myos.c needs to be compiled into 16-bit code by gcc, so the keyword " asm (".code16gcc\n");" is required.
myos.c needs to call the function puts defined in the assembly language program start.asm, so it must be declared as external with the extern keyword. Note that the underscore is not added when calling the puts function.
There is only one function in myos.c, the function name is _mymain, and the function name can be written as any name you like. Since the function needs to be jumped from the assembly language program start.asm, an underscore is added before the function name, indicating that it is a global symbol.
In the _mymain function, only the puts function is called, and then the endless loop continues. The function for displaying strings here is the custom puts in start.asm, not the library function printf of C language. This is because printf is a packaged library function, it needs to run in 32-bit environment, so it cannot be used in 16-bit myos.
Note the terminator of the string. In the assembly language, two strings have been printed in the form of a single character, and the mark for judging the end of the string is 0, so when defining the string, pay attention to adding the terminator 0 (that is, the value of the byte is 0 ). Here is a string defined in C language, you can view myos.o with a hexadecimal editor, and gcc will automatically
add a terminator 0 to the end of the string.
myos.c is simple, but the framework is important, and its content is shown below.

__asm__(".code16gcc\n");
extern int puts(char *str);
char * str = "Hello world from C Language! -myos.c"; 

int _mymain()
{
    
    
	puts(str);
	for(;;);
}

Three, assemble, compile and link
First assemble start.asm. nasm can output multiple formats. Considering that gcc on the MinGw platform can only output pe format, I wanted to compile start.asm into pe format, but the actual operation shows that pe format does not support 16-bit codes, so I can only compile start.asm It is compiled into elf format. Fortunately, the ld command is powerful and can link object files in many different formats.
The command to assemble start.asm is shown in the figure below. Although we have sunk enough to the bottom, we are still under the control of others unless we develop a compiler ourselves.
Please add a picture description

Note that the objdump command here is just to look at the format of the start.o file by the way, and it is not a necessary action in actual operation.
Then compile myos.c. myos.c can only be compiled into an object file with the "-c" option. The specific commands are shown in the figure below.
Please add a picture description

Finally, link them into one executable. The specific command is shown in the figure below

Please add a picture description

Among them, the "-s" option requires to ignore all symbol information in the output file, which can effectively reduce the size of the executable file. The "-entry=start" option and parameters indicate that the entry point of the program is the start symbol in start.asm , and the "-Ttext=0x0" option and parameters tell the linker that the absolute address of the text section is 0, and the "-o myos.exe" option and parameters indicate that the output executable file name is myos.exe.
4.
The name of the executable file after dumping and loading the link is myos.exe, the format is pei-i386, and the length is about 4KB. Various information in the pei-i386 format is specially prepared for the loader of the operating system. Myos does not want to be so complicated, so use the objcopy tool to copy myos.exe into an executable file myos.com in pure binary format, which is convenient for loading and execution . The specific command is:
objcopy -O binary myos.exe myos.com
where the -O option and the corresponding parameters require the output file to be copied into a pure binary format; myos.exe is the input file, and the -I option can be used to indicate the format of the input file. But you can also let objcopy determine the format of the input file without this option; myos.com is the output file.
The third chapter assumes that start.bin is myos operating system, because there are only 57 bytes, so mbr.asm only reads 1 sector when loading. Now myos.com is myos operating system, notice that its length is about 17KB, so you need to modify mbr.asm, and change the number of read sectors to 40.
In fact, myos.com is still a simple framework. The actual code and data are far less than 17KB, so why is it so big after objcopy? The main reason is the relative positioning of each logical address in ld, as shown in the figure below.
Please add a picture description
If myos.exe is loaded, the actual address of each segment must be relocated by the operating system. Now myos.com is a pure binary file, and loading does not require relocation, so the address of each segment must be actually filled in place during objcopy , so the file size is large. You can use a hexadecimal editor to view myos.exe and myos.com. When myos is enriched in the future, the code and data will occupy the positions filled with 0, and the actual size of the file will not change too much.
Use the tool writeA.exe to gracefully write myos.com to the position starting from the second sector in A.img.
The final running result is shown in the figure below.

Please add a picture description

Now, myos.com already has the basic framework of the operating system, and then you only need to continue to add functions.
5. Enable the batch processing function
Although myos is still very simple, with only a few files, many commands have been used in the debugging process, and the debugging process is a bit complicated. Later, with the continuous improvement of myos functions, there will be more files, and the debugging process will become more complicated.
Fortunately, cmd has a batch processing function, and the use of batch processing files can greatly reduce the input of commands, thereby simplifying the debugging process.
Batch processing means that a batch of commands can be processed together. In the early DOS system and later Windows system, you can put multiple commands in a text file with the extension bat, and then directly enter the file name on the command line, and all commands in the file will be automatically executed line by line. Order.
Of course, there are such functions in UNIX and Linux-like systems, and if you know Linux, you will know that shell script files in Linux are more powerful. In fact, UNIX is much older than DOS. Open Notepad, enter the debugging commands given in this section in turn, and save the file as a.bat (the file name is arbitrary, as long as the extension is bat). The contents of the file are as follows:

nasm mbr.asm -o mbr.bin
nasm -f elf start.asm -o start.o
gcc -c myos.c -o myos.o 
ld -s --entry=start -Ttext=0x0 start.o myos.o -o myos.exe 
objcopy -O binary myos.exe myos.com
writea mbr.bin A.img 1
writea myos.com A.img 2
pause

Then enter a or a.bat on the cmd command line, and press Enter, all commands in the file will be executed automatically.

Operating System Part Four

Chapter 4 C Language

4.1 Introduction to gcc

4.2 Basics of C language

4.3 Introduction to GNU Binutils

4.4 Mixed programming of 8086 assembly and C language

4.4.1 Several problems of mixed programming

4.4.2 An Example of Hybrid Programming

Guess you like