C/C++ string declaration definition, length, sizeof, detailed explanation of'\0'

introduction

Recently, I am working on a small project, using the stm32f103c8t6 single-chip microcomputer to convert the computer audio into the lighting effects on the LED light bar. In the process of debugging using the serial port printing function, I encountered problems related to the specific length of the string and the ending'\0'. After inquiring a large amount of information and conducting as detailed tests as possible, summarize them and record them here.

Four ways to declare a defined string

Strings have always been a frequent visitor to C\C++ programs. There are four commonly used ways to declare and define strings ( to save trouble, this article does not distinguish between "declaration" and "definition" in detail, and "definition" is used below to collectively refer to all related behaviors. .If necessary, please refer to: The difference between declaration and definition in c language ):

char *rick = "Wubba";
char morty[10] = "Whoa!";
char squanchy[] = "Squanch it.";
char snowball[] = {
    
    't', 'e', 's', 't', 'i', 's'};

Except for the second type, which needs to specify the length reserved for the string, the form is quite free. Of course, freedom also means hiding the details. For the relatively low-level C language, this is the reason why it is easy to make mistakes.

How to judge the length of a string

The definition of strings in the c language is easy and simple, but it involves many details, which is easy to be confused and easy to make mistakes. Here try to discuss clearly how to determine the length of the string.

Case 1: rick string

char *rick = "Wubba";

For the first definition method, the string ends with the character'\0' ( ASCII code is 0). When defining rick, we did not specifically add'\0' at the end, but it will be automatically added to the last digit of the rick string (the string "Wubba" we explicitly pointed out).

If we use %s to output the rick string, we will get exactly the five letters "Wubba" we expect.

int main() {
    
    
  char *rick = "Wubba";
  printf("%s", rick);
  
  getchar();
  return 0;
}

rick string %s output result


We define the rick string so freely, but the compiler can still interpret it accurately. The details hidden behind this powerful ability are the culprit that caused countless errors.

As mentioned earlier, the character'\0' is automatically added to the end of the rick string we defined. By using a for loop, this phenomenon can be clearly observed:

int main() {
    
    
  char *rick = "Wubba";

  for (int i = 0; i < 6; ++i)
    printf("%c", rick[i]);

  getchar();
  return 0;
}

for loop output

It can be found that there is a space between the output word and the cursor, which is the'\0' character that we forced to output ( this is really not what I typed ).


We can also test further:

printf("%d", rick[5]); //int形式输出'\0'

'\0' output in int form
The 0 in this circle is the ASCII code representation of '\0' .


If you draw a schematic diagram of the rick string (or "character array") in the memory (or "address" ):
What the Wubba string looks like in the address
the subscript is 6 and the values ​​after it are unknown, and are determined by the current state of the computer's memory.


In fact, we can also manually add'\0' to our string (not limited to the end). When it is treated as a string and output using %s, it will be cut off at'\0'.

  char *rick = "Wub\0ba";
  printf("%s", rick);

Output of "Wub\0ba"


If you use a for loop to force the output, you can see that the output result of'\0' is a space. The following characters are also initialized normally.

  char *rick = "Wub\0ba";
  for (int i = 0; i < 6; ++i)
    printf("%c", rick[i])

"Wub\0ba" for loop output

Case 2: Morty string

Case one has basically covered the basic logic and details of strings in the C language. The explanation for other situations has become relatively easy.

For the protagonist of this part, the morty character array, the biggest difference from other definition methods is that this method explicitly declares the length of the string, or the memory space reserved for the string.

  char morty[10] = "Whoa!";

As an example, the length declared here is obviously more than the actual string length. So how will our C language handle this situation?

A simple test, everything gradually becomes clear.

int main() {
    
    
  char morty[10] = "Whoa!";
  
  // morty数组存储内容(字符形式)
  printf("morty[]:");
  for (int i = 0; i < 15; ++i)
    printf("%4c", morty[i]);
  printf("\n");
  
  // morty数组存储内容(ASCII码形式)
  printf("ASCII:  ");
  for (int i = 0; i < 15; ++i)
    printf("%4d", morty[i]);
  printf("\n");

  // 对应下标
  printf("index:  ");
  for (int i = 0; i < 15; ++i)
    printf("%4d", i);

  getchar();
  return 0;
}

morty array test results


Here we use the formatting function of printf (such as %4d ) to type each character in the morty array, the ASCII code corresponding to each character, and the subscript corresponding to the element in the array in a vertical alignment manner, which is a sum The effect is the same as the drawing ( but it is more than a bit uglier than the previous picture ).

Observation shows that, in addition to the successful initialization of the "whoa!" string that we explicitly listed, in places other than the 10 char character space reserved for the morty array-the subscript is 14, there is undefined The value of (check the ASCII table and find it is a'.' character). However, in the position range of subscript 0-9 that we reserved for morty, the undefined values ​​(the five elements of subscript 5-9) are all 0.

At this time, we can make a bold guess: reserved but undefined elements will be initialized to 0 by default, that is,'\0'. Let us use practice to reveal the truth.

Practice is the sole criterion (for testing truth testing links

In order to explore whether the guess can be universally established, an intuitive method is to try to change the computer memory state to see if there are counterexamples (reserved but undefined elements are represented by ASCII values ​​other than 0).
Based on my limited understanding of computer science principles, I found the following constructive tests:

int main() {
    
    
  char squirrel[10] = "follow him"; // 干扰数组
  char morty[10] = "Whoa!";

  // morty数组存储内容(字符形式)
  printf("morty[]:");
  for (int i = 0; i < 15; ++i)
    printf("%4c", morty[i]);
  printf("\n");
  
  // morty数组存储内容(ASCII码形式)
  printf("ASCII:  ");
  for (int i = 0; i < 15; ++i)
    printf("%4d", morty[i]);
  printf("\n");

  // 对应下标
  printf("index:  ");
  for (int i = 0; i < 15; ++i)
    printf("%4d", i);

  getchar();
  return 0;
}

The results are as follows:


morty array practice test output


OhLaLa! It seems that immersion in the computer world in recent years has not only brought about the expansion of knowledge, but the related intuition seems to have been cultivated. Several more attempts were made and the results were similar-all in line with our conjecture.


(The size of the morty array is changed to 12, the content is slightly changed, and an additional string is defined after the morty array. The result is as follows)

int main() {
    
    
  char squirrel[10] = "follow him"; // 干扰数组
  char morty[12] = "Who123a!";
  char squirrel1[10] = "follow him"; // 干扰数组

  // morty数组存储内容(字符形式)
  printf("morty[]:");
  for (int i = 0; i < 15; ++i)
    printf("%4c", morty[i]);
  printf("\n");
  
  // morty数组存储内容(ASCII码形式)
  printf("ASCII:  ");
  for (int i = 0; i < 15; ++i)
    printf("%4d", morty[i]);
  printf("\n");

  // 对应下标
  printf("index:  ");
  for (int i = 0; i < 15; ++i)
    printf("%4d", i);

  getchar();
  return 0;
}

More attempts in practice


But this can only bring us close to the truth, but still unable to reach it. I have to wait for me to find an authoritative c language book~~ or to ask the founder of c language~~ to really know. I also hope that all readers who have the conditions can give answers, it is very grateful.

Situation 3: squanchy string

char squanchy[] = "Squanch it.";

(A section is written specifically for Case 3 just for the sake of a beautiful format.)

In fact, the situation of squanchy is exactly the same as situation 1, because the meaning of the array name squanchy in the C language is exactly the pointer to the first element of the array in memory . This is exactly the same as the rick string.


After designing the same test as in Case 2, the following results are obtained

int main() {
    
    
  char squirrel[10] = "follow him"; // 干扰数组
  char squanchy[] = "Squanch it.";

  //squanchy数组存储内容(字符形式)
  printf("squanchy[]:");
  for (int i = 0; i < 15; ++i)
    printf("%4c", squanchy[i]);
  printf("\n");
  
  // squanchy数组存储内容(ASCII码形式)
  printf("ASCII:     ");
  for (int i = 0; i < 15; ++i)
    printf("%4d", squanchy[i]);
  printf("\n");

  // 对应下标
  printf("index:     ");
  for (int i = 0; i < 15; ++i)
    printf("%4d", i);

  getchar();
  return 0;
}

squanchy test results

As you can see, no reserved space is specified, and the compiler just adds a'\0' to the end of the string for us (at the position where the subscript is 11). But this is usually enough for us.

Situation 4: Snowball string

  char snowball[] = {
    
    't', 'e', 's', 't', 'i', 's'};
  printf("%s", snowball);

This way of definition has great hidden dangers. The compiler seems to be biased against this approach. If you don't explicitly add'\0' to the end of the string, then no one will help you do this. The possible results of this are:

Insert picture description here
What about testis? What is this extra?


Here is a man who got more spectacular results—> Portal .

Use of sizeof() in strings

Normally, you can easily get the length of the array using the sizeof() function. And sizeof() is especially suitable for character strings (character arrays)-char in c language usually occupies 1 byte in memory , so the result of sizeof(char) is 1. Therefore, using sizeof() on a string is an intuitive string length, while other types of arrays are a bit more cumbersome:

char rick[] = "Wubba\0";
int egg[] = {
    
    123,456,666};
printf("        Type size    Array size     Array length\n");
printf("rick:   %9d %11d %16d\n", sizeof(char), sizeof(rick), sizeof(rick)/sizeof(char));
printf("egg:    %9d %11d %16d",   sizeof(int),  sizeof(egg),  sizeof(egg)/sizeof(int));

sizeof test result

Simply analyze

  1. For the int array egg: use sizeof() to get 12. This is 4 times the actual length of the egg array, because int occupies 32 bits and 4 bytes here. To get the length of the array, you need to use sizeof(egg)/sizeof(int). The two methods of string rick get the same result, and the reason is obvious.
  2. For the char array rick: Although we manually added'\0' to the end of the string, the compiler still added an extra'\0' for us (you can print the test through the for loop %d). This is easy to understand, the compiler will not take the time to identify whether the user has manually added'\0', but will treat it the same.

Character array or pointer?

During the testing process, unexpectedly new gains were made.

If sizeof() is used for all four string definition methods:

char *rick = "Wubba";
char rick1[] = "Wubba";
char rick2[] = {
    
    'W', 'u', 'b', 'b', 'a', '\0'};
char rick3[10] = "Wubba";

printf("%d ", sizeof(rick));
printf("%d ", sizeof(rick1));
printf("%d ", sizeof(rick2));
printf("%d ", sizeof(rick3));

Insert picture description here
The results obtained are amazing-although the results of rick1, 2 and 3 are reasonable ( note that the'\0' added by the compiler is also counted ), the result of the first rick array is a bit ridiculous number 4!

Confused, I tried to change the length of the string content, change the length of the string name, add the interference array... But no matter how far I changed it, I couldn't shake the result.

Looking up the information, it is found that this string definition method, in the eyes of sizeof(), is another similar but different thing-pointer.

char *rick = "Wubba"; // 被我们理解为的字符数组
printf("rick: %d %d\n", sizeof(rick), rick);

char suffix = '?';
char *p = &suffix;   // 一个指针
printf("p:    %d %d", sizeof(p), p);

Insert picture description here

Therefore, it is impossible to use sizeof() to obtain the length of the array normally if you use rick to define an array using an obvious pointer method.


to sum up

Strings are so common, we are familiar with it, but unfamiliar. Inquiry brings us closer to the truth and makes the programs we create more beautiful.

The following summarizes the key points of the article:

  1. Except for the following methods, the compiler will automatically add'\0' to the end of the string.
	char snowball[] = {
    
    't', 'e', 's', 't', 'i', 's', '\0'}; // 千万别忘了加'\0'!
  1. When using printf's %s output, the output will stop before the first'\0'.
  2. The value of the string outside the defined range may be outrageous, depending on the current state of the computer's memory. There “定义”are two situations mentioned here :
    • The 10 reserved spaces in morty[10] are all defined (the reserved but uninitialized area will be initialized to'\0' by default)
    • In the other three forms, it refers to the string written in quotation marks when we declare, plus the'\0' automatically added by the compiler (the string like snowball[] will not automatically add'\0' )
  3. Unlike other types of arrays, the length of character type arrays can be obtained directly by sizeof().
  4. If you use sizeof() on a string defined in the form of a pointer (such as *rick), what you get is not the length of the string, but the size of the memory space occupied by a pointer.

(Finish)

Guess you like

Origin blog.csdn.net/weixin_39591031/article/details/109726381