Member arrays and pointers in C language structure

The code is listed below:
 
 
  1. #include <stdio.h>   
  2. struct str{   
  3.     int  len;   
  4.     char s[0];   
  5. };   
  6.     
  7. struct foo {   
  8.     struct str *a;   
  9. };   
  10.     
  11. int main(int argc, char** argv) {   
  12.     struct foo f={0};   
  13.     if (f.a->s) {   
  14.         printf( f.a->s);   
  15.     }   
  16.     return 0;   

If you compile the above code, both VC++ and GCC will crash your program at line 14 of printf. @Laruence  said this is a classic pit, how come I think this is a classic pit? In the above code, you must ask, why the if statement does not judge fa? But the array in fa? What is in the mind of the person who writes this code? Or use this code to play tickets? In any case, I personally think that this is mainly due to a poor understanding of the C language. If this is a pit, then it is all a pit.

Next, you can debug it, or you can change the printf statement on line 14 to:

 
 
  1. printf("%x\n", f.a->s); 

You will see that the program does not crash anymore. Program output: 4. Now you know, access to the memory address of 0×4, it's strange if it doesn't crash. Therefore, you must have the following questions:

1) Why is there an error in the if statement on line 13? Is fa initialized to be empty? Why doesn't it crash when a null pointer is used to access member variables?

2) Why is the address 0×4 accessed? Damn, how did 4 come out?

3) In line 4 of the code, what is char s[0]? Zero length array? Why do you want to play like this?

Let us start from the basics to explain these weird problems in C language bit by bit.

Members in the structure

First of all, we need to know-the so-called variable is actually an abstract name of the memory address . In a statically compiled program, all variable names will be converted to memory addresses during compilation. The machine does not know the name we took, only the address.

So there are-stack memory area, heap memory area, static memory area, constant memory area, all variables in our code will be pre-placed in these memory areas by the compiler.

With the above foundation, let's take a look at the address of the members in the structure? Let's simplify the code first:

 
 
  1. struct test{   
  2.     int i;   
  3.     char *p;   
  4. }; 

In the above code, the pointers i and p in the test structure are stored in the C compiler as relative addresses—that is, their addresses are relative to the instance of struct test. If we have this code:

 
 
  1. struct test t; 

We use gdb to follow up, for instance t, we can see:

 
 
  1. # The p in the t instance is a wild pointer   
  2. (gdb) p t   
  3. $1 = {i = 0, c = 0 '\000', d = 0 '\000', p = 0x4003e0 "1\355I\211\..."}   
  4.     
  5. # Output the address of t   
  6. (gdb) p &t   
  7. $2 = (struct test *) 0x7fffffffe5f0   
  8.     
  9. #Output (ti) address   
  10. (gdb) p &(t.i)   
  11. $3 = (char **) 0x7fffffffe5f0   
  12.     
  13. #Output (tp) address   
  14. (gdb) p &(t.p)   
  15. $4 = (char **) 0x7fffffffe5f4 

We can see that the address of ti and t are the same, and the address of tp is 4 more than the address of t. To put it bluntly, ti is actually (&t + 0×0) , and tp is actually (&t + 0×4) . The offset addresses of 0×0 and 0×4 are the addresses of members i and p that are given to hard code by the compiler during compilation. So, you know, no matter what the instance of the structure is-accessing its members is actually adding member offsets .

Let's do an experiment:

 
 
  1. struct test{   
  2.     int i;   
  3.     short c;   
  4.     char *p;   
  5. };   
  6.     
  7. int  main () {   
  8.     struct test *pt=NULL;   
  9.     return 0;   

After compiling, we use gdb to debug. After initializing pt, we can see the following debugging: (We can see that even if pt is NULL, when accessing the members, it is actually accessing the internal address relative to pt)

 
 
  1. (gdb) p pt   
  2. $1 = (struct test *) 0x0   
  3. (gdb) p pt->i   
  4. Cannot access memory at address 0x0   
  5. (gdb) p pt->c   
  6. Cannot access memory at address 0x4   
  7. (gdb) p pt->p   
  8. Cannot access memory at address 0x8 

Note: The reason why the offset of pt->p above is 0×8 instead of 0×6 is because the memory is aligned (I am on a 64-bit system). For memory alignment, please refer to the article "In- depth understanding of C language ".

Okay, now you know why the address of 0×4 was accessed in the original question, because it is a relative address.

There are many relative addresses, which can play some interesting programming skills, such as making C feel object-oriented. You can refer to my article " Writing Object- Oriented Programs in C " exactly 11 years ago ( The dangerous gameplay of forced conversion with pointer types-Compared with C++, the C++ compiler helps you manage inheritance and virtual function tables, and the semantics are much clearer)

The difference between pointers and arrays

With the above foundation, you change the char s[0]; in the struct str structure in the source code to char *s; try it, you will find that when the 13-line if condition, the program is because of Cannot access The memory just hangs up. Why is the program declared as char s[0] and the program hangs on line 14, but when it is declared as char *s, the program hangs on the 13th line? So what is the difference between char *s and char s[0] ?

Before explaining this matter, it is necessary to look at the assembly code. After viewing it with GDB, I found that:

  • For char s[0], the assembly code uses the lea instruction, lea 0×04(%rax), %rdx
  • For char*s, the assembly code uses the mov instruction, mov 0×04(%rax), %rdx

The full name of lea is load effective address, which puts the address in, while mov puts the content in the address. So, it crashed.

From here, we can see that accessing the member array name actually gets the relative address of the array, and accessing the member pointer is actually the content in the relative address (this is the same as accessing other non-pointer or array variables)

In other words, for the array char s[10], the array names s and &s are the same (if you don't believe me, you can write a program and try it yourself). In our example, that is to say, both represent the offset address. In this way, if we access the address of the pointer (or the address of a member variable), then the program will not hang.

Just like the following code, it can be run without crashing at all (you will see that all lea instructions are used when you assemble it):

 
 
  1. struct test{   
  2.     int i;   
  3.     short c;   
  4.     char *p;   
  5.     char s[10];   
  6. };   
  7.     
  8. int  main () {   
  9.     struct test *pt=NULL;   
  10.     printf( "&s = %x\n" , pt->s);  //Equivalent to printf("%x\n", &(pt->s) );   
  11.     printf( "&i = %x\n" , &pt->i);  //Because of operator precedence, I did not write &(pt->i)   
  12.     printf("&c = %x\n", &pt->c);   
  13.     printf("&p = %x\n", &pt->p);   
  14.     return 0;   

Seeing this, do you think this can be considered a pit? Don't blame the language for everything, think about whether the problem is with yourself.


About zero-length arrays

First of all, we need to know that 0-length arrays are not allowed in ISO C and C++ specifications . This is why you will get a warning when compiling under VC++2012: "arning C4200: Non-standard extension used: zero size array in structure/union".

So why does gcc pass without even a warning? That's because gcc supports the C99 gameplay in advance, so the "zero-length array" gameplay is legal. The GCC document on this matter is here: " Arrays of Length Zero ", an example is given in the document (I changed it a bit and changed it to run):

 
 
  1. #include <stdlib.h>   
  2. #include <string.h>   
  3.     
  4. struct line {   
  5.    int length;   
  6.    char  contents[0];  // C99's gameplay is: char contents[]; does not specify the length of the array   
  7. };   
  8.     
  9. int  main () {   
  10.     int this_length=10;   
  11.     struct line *thisline = (struct line *)   
  12.                      malloc (sizeof (struct line) + this_length);   
  13.     thisline->length = this_length;   
  14.     memset(thisline->contents, 'a', this_length);   
  15.     return 0;   

Seeing this, do you think this can be considered a pit? Don't blame the language for everything, think about whether the problem is with yourself.

About zero-length arrays

First of all, we need to know that 0-length arrays are not allowed in ISO C and C++ specifications . This is why you will get a warning when compiling under VC++2012: "arning C4200: Non-standard extension used: zero size array in structure/union".

So why does gcc pass without even a warning? That's because gcc supports the C99 gameplay in advance, so the "zero-length array" gameplay is legal. The GCC document on this matter is here: " Arrays of Length Zero ", an example is given in the document (I changed it a bit and changed it to run):

 
 
  1. #include <stdlib.h>   
  2. #include <string.h>   
  3.     
  4. struct line {   
  5.    int length;   
  6.    char  contents[0];  // C99's gameplay is: char contents[]; does not specify the length of the array   
  7. };   
  8.     
  9. int  main () {   
  10.     int this_length=10;   
  11.     struct line *thisline = (struct line *)   
  12.                      malloc (sizeof (struct line) + this_length);   
  13.     thisline->length = this_length;   
  14.     memset(thisline->contents, 'a', this_length);   
  15.     return 0;   

The meaning of the above code is: I want to allocate an array of variable length, so I have a structure with two members, one is length, which represents the length of the array, and the other is contents, the content of the code array. This_length (length is 10) in the code behind represents the length of the data I want to allocate. (Does this look like a C++ class?) This gameplay is called Flexible Array in English, and its Chinese translation is called Flexible Array.

Let's take a look with gdb:

 
 
  1. (gdb) p thisline   
  2. $1 = (struct line *) 0x601010   
  3.     
  4. (gdb) p *thisline   
  5. $2 = {length = 10, contents = 0x601010 "\n"}   
  6.     
  7. (gdb) p thisline->contents   
  8. $ 3 = 0x601014  "aaaaaaaaaa" 

We can see: When outputting *thisline, we found that the address of the member variable contents is actually the same as thisline (offset is 0×0??!!). But when we output thisline->contents, you find that the address of the contents is offset by 0×4, and the content becomes 10'a'. (I think this is a GDB bug, VC++ debugger can display it well)

Let's continue, if you have a zero-length array like sizeof(char[0]) or sizeof(int[0]), you will find that sizeof returns 0, which means that the zero-length array exists in the structure , But does not account for the size of the structure. You can simply understand it as a placeholder identifier with no content. Until we allocate memory to the structure, the placeholder identifier becomes an array with length.

Seeing this, you will say, why do you want to do this, can't you declare contents as a pointer and then allocate more memory for it? Just like below.

 
 
  1. struct line {   
  2.    int length;   
  3.    char *contents;   
  4. };   
  5.     
  6. int  main () {   
  7.     int this_length=10;   
  8.     struct line *thisline = (struct line *)malloc (sizeof (struct line));   
  9.     thisline->contents = (char*) malloc( sizeof(char) * this_length );   
  10.     thisline->length = this_length;   
  11.     memset(thisline->contents, 'a', this_length);   
  12.     return 0;   

Isn't this clear? And there is nothing weird or hard to understand. Yes, this is also a common programming method. The code is very clear and easy to understand. Now that this is the case, why do you want to make a zero-length array? Is it gross? !

The reason for this is- we want to allocate a contiguous memory to the data in a structure! The significance of this has two advantages:

The first meaning is to facilitate memory release . If our code is in a function for others, you do a secondary memory allocation in it, and return the entire structure to the user. The user can free the structure by calling free, but the user does not know that the members of this structure also need to be free, so you cannot expect the user to discover this. Therefore, if we allocate the memory of the structure and the memory required by its members at one time, and return a structure pointer to the user, the user can free all the memory by doing a free one. (After reading this, you will definitely think that the closed destructor in C++ will make this much easier and cleaner)

The second reason is that this is conducive to access speed . Contiguous memory is good for improving access speed and also good for reducing memory fragmentation. (Actually, I personally don’t think it’s too high, anyway, you can’t run, you have to use the offset addition to address)

Let's take a look at how it is continuous, and use gdb's x command to view: (we know that the char contents[] in struct line {} does not occupy the memory of the structure, so struct line has only one int member , 4 bytes, and we have to allocate 10 bytes for contents[], so a total of 14 bytes)

 
 
  1. (gdb) x /14b thisline   
  2. 0x601010:       10      0       0       0       97      97      97      97   
  3. 0x601018:       97      97      97      97      97      97 

From the memory layout above, we can see that the first 4 bytes are int length, and the last 10 bytes are char contents[].

If you use pointers, it will look like this:

 
 
  1. (gdb) x /16b thisline   
  2. 0x601010:       1       0       0       0       0       0       0       0   
  3. 0x601018:       32      16      96      0       0       0       0       0   
  4. (gdb) x /10b this->contents   
  5. 0x601020:       97      97      97      97      97      97      97      97   
  6. 0x601028:       97      97 

A total of four lines of memory are output above, of which,

  • The first four bytes of the first line are int length, and the last four bytes of the first line are alignment.
  • The second line is char* contents, the 64-bit system pointer has 8 lengths, and its value is 0×20 0×10 0×60, which is 0×601020.
  • The third and fourth lines are the contents pointed to by char* contents.

From here, we see the difference-the place of the array is the content, while the pointer is the address of the content .

postscript

Well, my article is over here. But please allow me to nag a few more words.

1) After reading this article, do you think C is complicated? I think it is not simple. The complexity in some places is no less than C++.

2) Those who can't learn C++ must be those who can't even learn C. If you don't even learn C well, you are not qualified to despise C++.

3) When you are talking about pitfalls, you have to ask yourself whether there is a pitfall or there is a problem with your learning ability.

If you think your C language is not bad, you are welcome to take a look at " The Puzzles of C Language " and " Who Says C is Simple?" "There is also the article " Language Ambiguity " and "In- depth Understanding of C Language ".


Guess you like

Origin blog.csdn.net/junzhu_beautifulpig/article/details/51550307