[Study Notes] Those things about byte data and byte strings (b" ")

0 Preface

  Recently, I tried to use PyQt as a host computer, and encountered many problems about byte strings. Here are a few key points.

1 Let's first look at how to deal with this byte data in C language

  Those who play embedded modules should know that for the LCD module with a font, the font in it is actually encoded in GBK or GB2312 . The so-called "font" is actually a mapping between character encoding and pixel encoding for character display. , so that when displaying a certain character, there is no need to set the display pixel corresponding to the character, and only the font code corresponding to the character needs to be transmitted, which makes code writing much more convenient.

  Therefore, in this kind of LCD display device with a font library, if you need to display a certain character, you only need to encode the character to be displayed with the corresponding encoding format (GBK or GB2312 mentioned above). But when actually writing the code, it seems that there is no special coding step?

  Yes, this leads to the first question: How does the C language limit the encoding method of characters ? It is not clear about complex functions. Here is the simplest and most easily overlooked point- the encoding method of C language text files . If you don’t believe me, you can re-save the code file of the display part in utf8 format in the LCD project with font library. At this time, the display will most likely become garbled characters.

  Let me give another example to prove it. Create a new file in VS Code (the default is in utf8 format), and enter the following code:

#include "stdio.h"
#include "string.h"

char s[] = "中文";

int main()
{
    
    
    printf("中文");
    printf("%d", strlen(s));
    return 0;
}

Then run the code again. If the terminal uses powershell (version 5.x or 7.x is acceptable), there is a high probability that garbled characters will be output, because the default encoding method of powershell is GBK, and the displayed UTF-8 string output is of course garbled characters.

  Then click UTF-8 in the lower right corner, choose to save by encoding, select the GB2312 format, and then run it again. At this time, you will find that there are no garbled characters in the terminal, because the output encoding format matches the encoding method of the terminal.

  To sum up, in C/C++, the string encoding format of the program output (whether it is output to the terminal or the serial port) is directly linked to the encoding method of the file.

1.1 Summary of use

  How is it usually used? Here we still take the serial port output commonly used in embedded as an example.

  First of all, we need to know what byte data is. As we all know, in the process of data transmission, it is impossible to directly transmit characters that we humans can recognize. All content needs to be encoded into binary data before transmission. Of course, it is generally expressed in hexadecimal, and the essence is the same. of. 8-bit binary is a byte.

Therefore, there is a mapping relationship   between characters and byte data . For example, the binary (hexadecimal) 0x01 (casual example) corresponding to the character "I" is specified, then when the character "I" needs to be transmitted, then It is necessary to pass the byte data of 0x01, which is the encoding . Then the receiver receives this byte data, and extracts "I" from 0x01 according to the same mapping relationship, which is decoding . So the key point is that the two need to use the same mapping relationship, which is the encoding method , such as UTF8, UTF16 and so on.

  In the C language, the transmission of data is very random to be honest. This may also be a mechanism designed by the C language. For example, if I want to transmit the character "A", I can transmit this character directly'A' , then it will be automatically encoded into the corresponding byte data according to the file encoding method during the transmission process; I can also transmit the corresponding byte data of this character Numerical values , whether in binary, decimal, or hexadecimal, are acceptable. Similarly, in the process of performing calculations, characters can also be directly used as numerical values ​​for calculation, and what is taken is its corresponding coded value, that is, ASCII.

  To sum up, data transmission in C language is very random. It can be said that there is no concept of byte data, because it can basically be regarded as a value. If it is within the encoding range, it can also be converted to ( ) char()required The string to display.

The advantage of this design is that there is basically no difference between the transmitted byte data and the string used for display. Char can be directly calculated, and int values ​​can also be directly output to the terminal in the form of characters. But the disadvantage may be that it is too flexible, and the encoding method will be limited by the encoding format of the text file.

2 Let's take a look at how byte data is processed in Python

  Compared with the simplicity and directness in the C language, Python is more complicated in this part. The most important thing is that it adds a byte string .

  First of all, it needs to be clear that the byte string itself is not real byte data, it is just a presentation of byte data , because according to the settings in C language, both characters and bytes can be regarded as values, but in python not work:

insert image description here
That is, it cannot be converted directly through int, because it is not a value itself.

  • byte string value

  So what if I just want to take the values ​​​​in these byte strings? Use subscripts to index!

insert image description here
It can be seen from this that the data obtained by indexing the byte string with the subscript is actually int type data, which is actually an unsigned integer by default. If it is 0xff, the output is 255.

  The above is to read each byte, so what if the transmitted data is a combination of two bytes, and the target is the data represented by these two bytes? Here are the functions to be used intin the class from_bytes:

insert image description here

It should be noted that the byte string indexing an element gets an integer value ( a[0]), if there are multiple indexes ( a[0:3]) or seemingly multiple indexes ( a[0:1]), the result is still a byte string

int.from_bytesThe function of this function is to convert a byte string to an integer value, and supports multiple bytes. You can also specify big-endian mode or little-endian mode, and whether it is set to a signed integer (the above case of taking one element defaults to an unsigned integer).

  • byte string display

  In addition to the above mentioned taking out the value from the byte string for operation, there is also how to convert the byte string into a displayable string.

insert image description here

This is the most commonly used encodeand decodefunction, but it should be noted that if printthere is byte type data (byte), what is displayed is not the actual integer, but the visualization of byte data: byte string .

  There is another point that needs attention here, that is, the second parameter ignore, which indicates the solution to be adopted when an error occurs (generally, the character corresponding to the byte data in the encoding method cannot be found). There are three main options:

  • restrictOr leave blank: Indicates that once there is a byte with no corresponding character, an error will be reported directly
  • ignore: Indicates that there is a byte with no corresponding character, ignore it directly, look at the next one, and do not output
  • replace: Indicates that there is a byte without a corresponding character, replaced by a "?" character

  So besides encoding and decoding, is there any other way to display byte data? For example, the content displayed in many serial port debugging assistants is the removed 0xcharacters. For example, when displaying 0xAFthis byte data, AFthe string is displayed. How is this achieved? A function that actually uses byte stringshex

insert image description here

However, it should be noted that hexthe function can also be used alone. Its function is to convert an integer into a hexadecimal format and output it in a string format:

insert image description here

  Finally, there is another way, which is to use reprfunctions to visualize all the contents of the byte string, including characters such as \, which may be used less.x

insert image description here

Guess you like

Origin blog.csdn.net/ZHOU_YONG915/article/details/130233561