C language pitfalls and defects (1)

CSDN Topic Challenge Phase 2
Participation Topic: Study Notes


foreword

  • A sharp knife is easy in the hands of Pao Ding, but it may hurt himself in the hands of ordinary people.
  • The C language is like a sharp knife, always easy to hurt those who can't master it. If you have been hurt by it or think that you have not fully mastered it, please follow this article to learn how to avoid "C language hurting careless people".

Introduction

After reading this article, what will you gain? The details are as follows:
Part 1: The problems that occur when programs are divided into tokens are studied.
Part II: Continues the study of problems that arise when a program's tokens are combined by the compiler into declarations, expressions, and statements.
Part III: Examines C programs that consist of multiple parts, compiled separately and bound together.
Part IV: Deals with conceptual misconceptions: what happens when a program is actually executed.
Part V: Study the relationship between our programs and the common libraries they use.
Part VI: We note that the programs we write are not the programs we run; the preprocessor will run first.
Part VII: Discusses the problem of portability: the reasons why a program that works in one implementation may not work in another.

Part 1: Lexical Defects

  • Compilers are often called lexical analyzers. A lexical analyzer examines the sequences of characters that make up a program and divides them into tokens, a token being a sequence of one or more characters that has a (relatively) uniform meaning when the language is compiled.
  • C programs are compiled into tokens twice. First the preprocessor reads the program, it has to tokenize the program to discover the identifiers that identify the macros, and then replaces the macro calls by evaluating each macro. Finally, the macro-replaced program is assembled into a character stream and sent to the compiler, and the compiler divides this stream into tokens a second time.
  • In this section, we explore common misconceptions about the meaning of tokens, and the relationship between tokens and the characters that compose them. We'll talk about preprocessors later.

1. 1 = is not ==

  • C language uses = to represent assignment and == to represent comparison. This is because "assignment" is more frequent than "comparison", so it is assigned a shorter symbol.
  • C treats assignment as an operator, makes it easy to write multiple assignments (such as a = b = c), and can embed assignments inside a large expression. This convenience has a potential risk: it may be used “比较” 写成了 “赋值”. Examples :
    (1) The following statement wants to 检查 x 是否等于 ybe written as 将 y 赋值给 x 并检查结果是否非零:
    if (x = y) {
          
          
    	foo();
    }
    
    When you need to assign a value to a variable and then check whether the variable is non-zero, you should consider explicitly giving the comparison operator. rewritten as:
    if (0 != (x = y)) {
          
          
    	foo();
    }
    
    (2) Here is a loop that wants to skip spaces, tabs, and newlines:
    while (c == ' ' || c = '\t' || c == '\n') {
          
          
    	c = getc(f);
    }
    
    Accidentally used = instead of == when comparing with '\t'. This "comparison" actually assigns '\t' to c, and then checks whether the (new) value of c is zero. Since '\t' is non-zero, this "comparison" will always be true, so the loop will eat the entire file. If the program does not read past the end of the file, the loop will run forever. In order to develop good writing habits, it should be rewritten as:
    while (' ' == c || '\t' == c || '\n' == c) {
          
          
    	c = getc(f);
    }
    

1.2 Multi-character tokens

  • Some C tokens, such as /, *, and = have only one character. And some other C tokens, such as /* and ==, and identifiers, have multiple characters. So let's talk about what troubles the C notation can bring us, and how to avoid it.
  • The following statement looks like it sets the value of y to x divided by the value pointed to by the p pointer:
    y = x/*p
    
    In fact, /* starts a comment, so the compiler simply swallows the program text until the occurrence of */. In other words, this statement simply sets the y value to the x value, without seeing p at all. should be rewritten as:
    y = x / *p;
    
    或者
    
    y = x / (*p)
    
  • This ambiguity can cause trouble in other contexts. For example, older versions of C used =+ to represent += in current versions. Such a compiler would:
    a=-1; 被视为 a =- 1; 或者 a = a - 1;
    
    This would surprise a programmer who intended to write the following statement:
    a = -1;
    
  • Additionally, such older versions of the C compiler would treat the
    a=/*b; 断句为:a =/ *b; 
    
    Although /* looks like a comment.所以说,平时编写代码时 ”等式两边“ 要养成留空格的习惯。

1.3 Some exceptions for multi-character tokens

  • Compound assignment operators such as += are actually two tokens. therefore,
    a + /* strange */ = 1 和 a += 1
    
    is a meaning. This is the only special case that looks like a single token but is actually multiple tokens.
  • In particular, it is illegal to write:
    p - > a
    
    It is not synonymous with the following wording.
    p -> a  //p -> a 等同 p->a
    
  • Also, some older compilers still treat =+ as a separate token and as a synonym for +=.

1.4 Strings and characters

  • Single quotes and double quotes have completely different meanings in C, and in some confusing contexts they can lead to strange results rather than error messages.
  • A character in single quotes is just another way of writing an integer . This integer is a corresponding value for the given character in the implemented collating sequence. Thus, in an ASCII implementation, 'a' means exactly the same thing as 0141 or 97.
  • A string in double quotes is just a shorthand for a pointer to an unnamed array initialized with characters and an additional character with a binary value of zero .
  • The following two program fragments are equivalent:
    printf("Hello world\n");
    
    char hello[] = {
          
          
    	'H', 'e', 'l', 'l', 'o', ' ',
    	'w', 'o', 'r', 'l', 'd', '\n', 0
    };
    
    printf("%s", hello);
    
  • Using a pointer instead of an integer will usually get a warning message (and vice versa), as will using double quotes instead of single quotes
    (and vice versa). The exception is for compilers that don't check parameter types. Therefore, use
    printf('\n');
    
    to replace
    printf("\n");
    
    Often you get weird results at runtime.
    This is because an integer is usually large enough to hold multiple characters, and some C compilers allow multiple characters to be stored in a character constant. This means that 'yes'substituting with "yes"will not be detected. 前者意味着"Denotes in some definitions formed from the union of the characters y, e, s 一个整数" and " of 后者意味着each consisting of y, e, s, and a null character ". Any consistency between the two is purely coincidental.四个连续存贮器区域第一个的地址

Part II: Syntactic Deficiencies

  • To understand a C program, it is not enough to know the symbols that make it up. Also understand how these notations form declarations, expressions, statements
    , and programs. Although these constituents are often well-defined, the definitions are sometimes counterintuitive or confusing.
  • In this section we will look at some less obvious syntactic constructs.

2.1 Statement of Understanding

  • How to understand the subroutine called by the hardware at address 0:

    (*(void(*)())0)(); 
    
  • Seeing expressions like this must have horrified a C programmer. But it doesn't matter, before we can understand the expression of the declaration.

  • Every C variable declaration has two parts: a type and a set of expressions of a specific format that are expected to evaluate to that type .

  • The simplest expression is a variable:

    int f, g;
    

    Declares that the expressions f and g, when evaluated, have type int. Since what is being evaluated is an expression, parentheses can be used freely:

    int ((f));
    
  • The same logic applies to functions and pointer types . For example:

    int func(); 
    

    Indicates that func is a function that returns an int. Similarly:

    int *pf; 
    

    Indicates that *pf is a pointer to int type. Composition declarations of these forms do the same for expressions. therefore,

    int *g(), (*h)();
    

    Indicates that *g(), (*h)() are expressions of type int. Since () is more tightly bound than *, *g() and *(g()) mean the same thing: g is a function returning a pointer to int, and h is a pointer to a function returning an int.
    Through the above examples, after we know how to declare a variable of a given type, we can easily write a model (cast) of a type : 只要删除变量名和分号,并将所有的东西包围在一对圆括号中that's it. For example:

    int *g();
    

    declares g to be a function returning a pointer to int, so (int *()) 就是它的模型.

  • Armed with the above knowledge, we are now ready to tackle (*(void(*)())0)(). We can divide it into two parts for analysis.

  • First, suppose we have a pointer variable fp, and we want to call the function pointed to by fp. Can be written like this:

    (*fp)();
    

    If fp is a pointer to a function, then *fp is the function itself, and (*fp)() is a void value, so its declaration looks like this:

    void (*fp)();
    
  • Then we proceed to the second step of analysis. If we find an appropriate expression to replace fp, and C can read and understand this type, then we can write:

    (*0)()
    

    But this doesn't work , because * 运算符要求必须有一个指针作为它的操作数,而且这个操作数必须是一个指向函数的指针to guarantee the result of * can be called. Therefore, we need to convert 0 to a type that can describe "pointer to a function returning void" .

    Through the declaration of void (*fp)(), we know its model, just remove the name from the declaration of the variable:

    void(*)();
    
  • So we can convert 0 to a "pointer to function returning void" like this:

    (void(*)())0
    

    Finally, we replace fp with (void(*)())0 :

    (*(void(*)())0)();
    

2.2 Operators don't always have the precedence you might think

  • Suppose there is a declared constant FLAG that is an integer whose binary representation has a bit set (in other words, it is a power of 2), and you wish to test an integer variable flags whether that bit is set Position. The usual way of writing is:

    if(flags & FLAG) ...
    

    Its meaning is clear to many C programmers: the if statement tests whether the expression enclosed in parentheses evaluates to zero. For the purpose of expressing the statement more clearly, it can be written like this:

    if(flags & FLAG != 0) ...
    

    Now the statement is easier to understand. But its expression is wrong, because != binds more tightly than & , so it is parsed as:

    if(flags & (FLAG != 0)) ...
    

    There is one exception though . For example, when FLAG is 1 or 0, it is invalid for its power of 2 [because the result of != is either 1 or 0].

  • Suppose you have two integer variables, h and l, with values ​​between 0 and 15 inclusive, and you want to set r to an 8-bit value with l as the low bit and h as the high bit. A natural way to write it is:

    r = h << 4 + 1;
    

    Unfortunately, this is wrong. "addition" is more tightly bound than "shift" , so this example is equivalent to:

    r = h << (4 + l);
    

    There are two correct ways:

    r = (h << 4) + l;
    
    r = h << 4 | l;
    

    One way to avoid this problem is to put everything in parentheses, but expressions with too many parentheses can be hard to understand, so it's best to remember the precedence in C.

  • However, there are 15 levels of operator priority in C language, which is too difficult. However, it can be made easier by grouping them:

    1
    2
    3
    4
    5
    特别说明:
    同一优先级The operator for is determined 运算次序by 结合方向.
    Simple priority: > 算术运算符> 关系运算符> &&> ||>赋值运算符

2.3 Pay attention to the use of semicolons

  • An extra semicolon in C usually makes a little difference: either an empty statement, which has no effect; or the compiler might raise a diagnostic message, which can be conveniently removed. An important difference is in if and while statements that must be followed by a statement. Consider the following example:

    if (x[i] > big);
    	big = x[i];
    

    This will not cause compilation errors, but the meaning of this program is quite different from the following:

    if (x[i] > big)
    	big = x[i];
    

    The first block is equivalent to:

    if (x[i] > big) {
          
           }
    big = x[i];
    

    That is, directly equivalent to:

    big = x[i];
    
  • Another place where the semicolon makes a huge difference is at the end (without the semicolon) of the structure declaration preceding the function definition. Consider the following program fragment:

    struct foo {
          
          
    	int x;
    }
    
    func() {
          
          
    	.....
    }
    

    in 紧挨着 func 的第一个 } 后面丢失了一个分号. Its effect is to declare a function func, the return value type is struct foo, and this structure becomes part of the function declaration . If a semicolon appears here, func will be defined with a default integer return value.

2.4 switch statement

  • Usually a case section in a switch statement in C can go to the next one. For example, consider the following fragment of a C program:
    switch(color) {
          
          
    case 1: printf ("red");
    		break;
    case 2: printf ("yellow");
    		break;
    case 3: printf ("blue");
    		break;
    }
    
    This program fragment prints red, yellow, or blue (the default) depending on whether the variable color has a value of 1, 2, or 3. Case labels in C are true labels: control flow can enter a case label without restriction. Looking at another form, suppose the C program fragment is as follows:
    switch(color) {
          
          
    case 1: printf ("red");
    case 2: printf ("yellow");
    case 3: printf ("blue");
    }
    
    Assume color = 2. The program would then print yellowblue, because control naturally passes to the next call to printf().
  • This is both the strength and the weakness of the C language switch statement. It's a weakness because it's easy to forget a break statement, causing the program to behave cryptically and abnormally. It is said to be an advantage because by deliberately removing the break statement, it is easy to implement control structures that are difficult to implement by other methods. Especially in a large switch statement, we often find that the processing of a case can simplify other special processing.
  • For example, consider that the compiler looks for a token by skipping whitespace characters. Here, we treat spaces, tabs, and newlines the same, except that newlines also cause the line counter to be incremented:
    switch(color) {
          
          
    case '\n':
    	linecount++;
    	/* no break */
    case '\t':
    case ' ':
    	.....
    }
    

2.5 Function calls

  • Unlike other programming languages, C requires a function call to have an argument list, but it can have no arguments. So, assuming func is a function,
    func();
    
    is the statement that calls the function , and
    func;
    
    Do nothing and it will 作为函数地址被求值,但不会调用它.

2.6 Hanging else problem

  • We don't forget to mention this issue when discussing any syntax flaws. Although this problem is not unique to the C language, it still hurts C programmers with years of experience.

  • Please see the following program fragment:

    if (0 == x)
    	if (0 == y) error();
    else {
          
          
    	z = x + y;
    	func(&z);
    }
    

    The programmer who wrote this program clearly intended to separate the cases into two cases: x = 0 and x != 0. In the first case, the block does nothing except call error() when y = 0. In the second case, the program sets z = x + y and calls func() with the address of z as an argument.
    However, the actual effect of this procedure is very different . The reason for this is that an else is always associated with its nearest if . If we want this program to run according to the actual situation , we should write it like this:

    if (0 == x) {
          
          
    	if (0 == y) {
          
          
    		error();
    	}
    	else {
          
          
    		z = x + y;
    		func(&z);
    	}
    }
    

    In other words, do nothing when x != 0 occurs.

    If you want to achieve the effect of the first example, you should write:

    if (0 == x) {
          
          
    	if (0 == y) {
          
          
    		error();
    	}
    }
    else {
          
          
    	z = z + y;
    	func(&z);
    }
    

Part Three: Links

3.1 You have to check external types yourself

  • Suppose you have a C program divided into two files: A and B. The A file contains the following external declarations:
    int n;
    
    And the B file contains the following external declarations:
    long n;
    
  • This is not a valid C program because some external names are declared as different types in the two files. However, many implementations do not detect this error because the compiler does not know the contents of one file when it compiles the other. Therefore, the work of checking the type can only be done by the linker (or some tool program such as lint); if the linker of the operating system cannot recognize the data type, the C compiler cannot enforce it too much.
  • So, what actually happens when this program runs? There are many possibilities for this:

    1. The implementation is smart enough to detect type conflicts. then we get a diagnostic message that n has different types in the two files.
    2. The implementation you are using treats int and long as the same type. Typically, machines can do 32-bit arithmetic naturally. In this case your program might work as if you declared the variable as long (or int) both times. But this program works purely by chance.
    3. The two instances of n require different storage, and they share storage in such a way that an assignment to one is also valid for the other. This can happen, for example, that the compiler can place an int in the low order of a long. Whether this is system-based or machine-based, the operation of such a program is equally accidental.
    4. Two instances of n share storage in another way, that is, assigning a value to one has the effect of assigning a different value to the other. In this case, the program may fail.

  • However, situations like this happen surprisingly often in programming. For example, an A file in the program contains the following declaration:
    char filename[] = "etc/passwd";
    
    And another B file contains this declaration:
    char *filename;
    
    (1) Although arrays and pointers behave very similarly in some contexts, they are different. In the first declaration, filename is the name of a character array. Although it is possible to generate a pointer to the first element of the array using the name of the array, this pointer is generated only as needed and does not persist. In the second declaration, filename is the name of a pointer. This pointer can point anywhere the programmer makes it point to. If the programmer does not assign a value to it, it will have a default value of 0 (null) [ *注:实际上,在 C 中一个为初始化的指针通常具有一个随机的值,这是很危险的!].
    (2) These two declarations use storage in different ways, and they cannot co-exist. One way to avoid this type of conflict is to use a tool like lint (if you can). In order to check for type conflicts between different compilation units of a program, some programs need to see all of their parts at once. Typical compilers can't do it, but lint can.
    (3) Another way to avoid the problem is to put external declarations into include files. In this case, the type of an external object appears only once -- [Some C compilers require only one definition per external object, but can have multiple declarations. When using such a compiler, we can easily put a declaration in an include file and its definition elsewhere. This means that each external object's type will appear twice, but this is better than more than two occurrences. ]

Chapter Sequel Statement

  • Due to the large content of the chapter "C Language Traps and Defects", this article will be divided into "two parts" to complete the editing;
  • Part 4~Part 7 will be broken down in the next section, so stay tuned. . .

Guess you like

Origin blog.csdn.net/m0_37383484/article/details/127090347