Weak coding: Is the communication language between programs safe?

Hello, I am Wang Haotian.

Entering the big chapter of encryption failure, our first topic is-weak encoding.

If you want to understand what coding is, then imagine the shopping scene on Double Eleven.

We bought a lot of snacks, household items, and two-dimensional figures through the e-commerce platform. It was a pleasure to place an order for a while, and it was always a pleasure to place an order, so people all over the country were buying, buying, and buying. At this time, the problem of the e-commerce platform is coming. How to deliver all kinds of products to everyone? It is impossible to create a transportation route for each commodity.

So express delivery appeared. By packaging different types of goods in square cartons, it not only protected the integrity of the goods during transportation, but also ensured the convenience of transmission.

This is a typical scenario of encoding. In the process of transmitting data between the server and the client, we cannot confirm whether the transmitted content contains content that is not supported by the transmission protocol. Therefore, we hope to encode the transmitted data before data transmission. normalized.

It must be noted here that the encoding is not confidential. It's like the courier guy just doesn't want to know what's in the package. If he wants to know, it shouldn't be difficult.

coding

Let's take a look at how Wikipedia defines encoding:

Encoding is the process of converting information from one form or format to another; decoding is the inverse of encoding.

As an elegant development engineer, or a "big hacker", it is very important to master a variety of coding features. In this talk, I will take you into the world of coding.

Character Encoding

Character encoding is to map the characters in the character set to an object in the specified set, so that the text can be stored in the computer or transmitted between networks. In the early days of computer development, character sets such as ASCII were the standard form of character encoding, but these character sets have great limitations, such as only applicable to English scenes, etc., so people have developed many methods to extend them, the type of encoding Also progressively enriched:

  • Early standards: ASCII, EBCDIC
  • Western European standards: ISO-8859-1, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-11, ISO-8859-15, etc.
  • DOS character set: CP437, CP737, CP850, etc.
  • Windows character set: Windows-1250, Windows-1251, Windows-1252, etc.
  • Chinese: GB2312, GBK, etc.
  • Unicode:Unicode、UTF-7、UTF-8、UTF-16、UTF-32 etc

These character sets have their own birth meanings and application scenarios. We often encounter some of them in our daily work. Here we select a few representative character sets for in-depth study:

ASCII ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is the most commonly used encoding to represent letters, numbers, and common symbols. If you are using a Mac or Linux type system, you can directly use the following command to view all ASCII characters:

> man ascii
ASCII(7)             BSD Miscellaneous Information Manual             ASCII(7)

NAME
     ascii -- octal, hexadecimal and decimal ASCII character sets

DESCRIPTION
     The octal set:

     000 nul  001 soh  002 stx  003 etx  004 eot  005 enq  006 ack  007 bel
     010 bs   011 ht   012 nl   013 vt   014 np   015 cr   016 so   017 si
     020 dle  021 dc1  022 dc2  023 dc3  024 dc4  025 nak  026 syn  027 etb
     030 can  031 em   032 sub  033 esc  034 fs   035 gs   036 rs   037 us
     040 sp   041  !   042  "   043  #   044  $   045  %   046  &   047  '
     050  (   051  )   052  *   053  +   054  ,   055  -   056  .   057  /
     060  0   061  1   062  2   063  3   064  4   065  5   066  6   067  7
...

A character in ASCII occupies 8 bits, and the first bit is always 0. In this case, it can support 2 to the 7th power, that is, 128 characters, among which 00100000~01111110 are all printable characters.

GB 2312 & GBK For Chinese, the Chinese characters are extensive and profound, and a mere 128 characters must not be able to meet our needs, so the Chinese encoding was born. Considering that 8-bit encoding is far from enough and needs to be compatible with ASCII encoding, the GB2312 encoding method came into being, which has the following characteristics:

1. 使用两个8位来进行编码;
2. 0~127编号的字符使用ASCII标准编码;
3. 两个大于127的字符连在一起时表示一个汉字,前一个称为高字节,后一个称为低字节。

What we usually call full-width characters are double-byte characters, and single-byte characters are half-width characters. But later it was found that the GB2312 encoding still does not have the ability to represent all Chinese characters, so we optimized the third condition above, and the GBK encoding was born, where K means "extension". The third feature after optimization is expressed as:

3. 允许低字节使用0~127的字符,仅凭借高字节判断是否为中文。

GB2312 encoding example:

你好hello123
\xC4\xE3\xBA\xC3\x68\x65\x6C\x6C\x6F\x31\x32\x33

Common GBK encoding:

你好hello123
\xC4\xE3\xBA\xC3\x68\x65\x6C\x6C\x6F\x31\x32\x33
  • Unicode & UTF-8 For the texts of all countries in the world, the ASCII character set is no longer sufficient for use. For this problem, ISO has proposed an ultimate solution that covers all texts in the world: Unicode. It initially stipulates that all characters are represented by two bytes, this version is UTF-16; but later found that it is still not enough, so it is extended to four bytes, this version is UTF-32. At present, the latest Unicode already supports emoji expressions, making our text language richer and more vivid.

But will using Unicode to store all characters increase the storage cost? After all, an ASCII single character only occupies 1 byte, and GBK only occupies 2 bytes. If all UTF-32 is used to represent it, it means at least 2 times the expansion of storage space. At this time, another new encoding algorithm Appeared to solve this problem, and became a widely used encoding type in the coding process - UTF-8.

UTF-8 is a variable-length encoding. For example, for ASCII code, it is represented by 1 byte. For other types of encoding, a high-order byte is added in front. In this way, it is very suitable in an environment where English coding is common but Chinese annotations are carried.

Unicode encoding example:

你好hello123
\x00004F60\x0000597D\x00000068\x00000065\x0000006C\x0000006C\x0000006F\x00000031\x00000032\x00000033

TF-8 encoding example:

你好hello123
\xE4BDA0\xE5A5BD\x68\x65\x6C\x6C\x6F\x31\x32\x33

program code

URL encoding URL encoding is also called percent encoding, because its encoding feature starts with %, isn’t it very vivid? It is primarily used for the encoding of Uniform Resource Locators (URLs) and is also suitable for the encoding of Uniform Resource Identifiers (URIs). The characters allowed by URI are mainly divided into two categories: reserved characters and unreserved characters: reserved characters mainly refer to characters with special meanings, such as etc.; ! * &unreserved characters mainly refer to characters without special meanings, such as A B Cetc.

If a reserved character is meaningful in the context and needs to be displayed in the URI according to the content format, then the character should be percent-encoded. Percent encoding will first represent the ASCII value of the character as two hexadecimal numbers, and then place an escape character in front of it; %for non-ASCII characters, first convert it to UTF-8 byte order, and then place the escape character characters %.

Example of percent encoding in UTF-8 format:

你好hello123
%E4%BD%A0%E5%A5%BDhello123

Base64 encoding Base64 is a method of representing binary data with 64 characters. Since 64 = 2 ^ 6, every 6 bits can be mapped to a printable character, and since every 6 bits is equal to three-quarters of a byte, it can be simply understood that every three-quarters of a byte is mapped to a new word In this way, it is easy to calculate the encoding expansion rate of base64. Base64 is commonly used to represent, transmit, and store binary data.

Simply think about the rules of Base64, and you will find an interesting thing: if the number of bytes to be encoded is not divisible by 3, then Base64 encoding will not be possible. Therefore, the complete Base64 encoding rule is to use "0" to fill up the insufficient number of bytes at the end so that it can be divisible by 3, and then perform Base64 encoding. The increased number of bytes is marked with an equal number of "=" at the end.

Base64 encoding example:

你好hello123
5L2g5aW9aGVsbG8xMjM=

Encoding vs Encryption

Through some discussions on encoding, we have learned some characteristics of encoding. Here we will briefly compare encoding with the encryption we learned in the last class to see what are the same and different.

  • Both encoding and encryption are reversible operations: the original data can be recovered by decoding the encoded data; we can also obtain the original data by decrypting the encrypted data.

  • Encoding requires only one input, while encryption requires two inputs: After selecting the encoding function, we only need to select the data to be encoded; for the encryption function, in addition to the data to be encrypted, we also need to select the encryption key.

  • The purpose of encoding is to facilitate data interaction, and the purpose of encryption is to protect data interaction: through encoding, data can be transferred between different protocol systems for the purpose of availability; through encryption, data can be safely transmitted for the purpose of confidentiality.

encoding vs escaping

In general, escaping is a concept that is easily confused with encoding. Because compared with encryption, escaping requires only one input and two conditions of reversible operation. But the usage scenarios of escaping and encoding are different, that is, their "purposes" are different.

Unlike encodings, which serve the purpose of facilitating data interaction, escaping generally serves two purposes:

  1. Encode entities on a sentence, such as device commands or special data that cannot be directly represented by printable characters;
  2. As a special character reference, it is mainly used to represent characters that cannot be entered in a printable form in the current context, such as carriage return.

A sequence of characters beginning with an escape character is called an escape sequence, and usually an escape character has no meaning of its own, so an escape sequence generally has 2 or more characters.

By judging the purpose of the two, we can easily distinguish between encoding and escaping.

Case combat

Knowing the basics of coding, let's look at several coding-related security issues. These actual combat cases have been set up in MiTuan , and you can use them directly by searching [Coding Vulnerabilities Collection].

wide byte injection

After starting the target machine, we can directly see a page that supports HTTP GET requests. The page tells us the code logic inside the sample vulnerability: the program uses the addslashes function to process the str parameter in the user’s GET request, and then splicing into the SQL statement, and the actual executed SQL statement will also be printed on the page, which is convenient for us to debug the process of exploiting the vulnerability.

picture

Then we started to try to exploit this potential SQL injection vulnerability.

The first step is to find the injection point. Since this page only supports the input of the str parameter, we can judge that the injection point should be here. We can try some conventional injection methods first to see the processing results of the page. For example, by trying 1 1'these two different inputs, we found that the SQL statement is not closed after being processed by the addslashes function, and we cannot perform injection in this case.

Although 1'this parameter does not achieve the goal of closing the SQL statement, the construction of the SQL statement this time can give us some new inspiration:

select * from user where user='1\''

Through this complete SQL statement, we can find 1that and \are consecutive characters. In this case, if we 1change to a special character so that it can be combined with \to form a new character through encoding, we can achieve encoding bypass.

The second step is to practice our idea and find a special character that can be combined with \to form new character.

Through the encoding tool, we can know \that the GBK encoding is \x5C. After the study just now, we know the characteristics of the Chinese character encoding in the GBK encoding, so we only need to select a suitable high-order byte. For example, here I choose \xC4, through the coding tool, we can know \xC4\x5Cthat is a Chinese character , so the complete content after the splicing is completed \xC4\x5C\x26\x23\x33\x39\x3Bcan meet the requirements.

Through these operations, we 1replace with %C4to realize our encoding bypass idea in the first step.

The third step is very simple, just enter the GET request %C4as a parameter. It should be noted that the str parameter in the GET request needs to apply the URL encoding format, and if you want to get the URL encoding of GB2312, you only need to add the "%" symbol in front. %C4Therefore 'splicing together with , the complete parameter obtained is %C4%27.

Enter the complete parameters we constructed into the browser address bar for access, and you can get the output of the page:

select * from user where user='腬''

Next, you can further add other SQL control characters for injection actions:

str=%C4%27%23
select * from user where user='腬'#'

CVE-2021-42574

This is a vulnerability discovered by researchers at the University of Cambridge, and it is caused by a coding issue, commonly seen in supply chain contamination type vulnerabilities. Before introducing the principle of the vulnerability, let's have a close contact with it:

#include <stdio.h>
#include <stdbool.h>

int main() {
    bool isAdmin = false;
    /* begin admins only */ if (isAdmin) {
        printf("You are an admin.\n");
    /* end admins only */ }
    return 0;
}

The logic of the above C code is very simple. The core logic is to determine the bool type of isAdmin and execute corresponding actions. According to the initialization value of isAdmin, the function should directly enter the return logic without producing any output. Here we run directly:

$> clang program.c && ./a.out
You are an admin.

A magical thing happened, even though the value of isAdmin is False, the program still executes the function inside the if judgment branch.

Smart do you know why?

In fact, the mystery lies in the "control characters". By using Unicode control characters, we can visually reverse the order of the encodings. For example, the above sample code, its real code is as follows:

#include <stdio.h>
#include <stdbool.h>

int main() {
    bool isAdmin = false;
    /*RLO } LRIif (isAdmin)PDI LRI begin admins only */
        printf("You are an admin.\n");
    /* end admins only RLO { LRI*/
    return 0;
}

It can be seen that in the real code, the if statement is completely wrapped by comment symbols, and there is no real judgment logic at all.

So why does Unicode set such malicious "deceptive" characters?

In fact, it is not that Unicode is malicious. Here we review the reason for the birth of Unicode-the ultimate encoding scheme that includes global text. The culture of human society is very rich. Taking language as an example, there are not only characters that are read and written from left to right like Chinese characters, but also characters that are read and written from right to left like Arabic. Therefore, in order to satisfy this For text application scenarios, Unicode provides control characters that affect the reading order.

Due to the prevalence of supply chain pollution attacks in recent years, once hackers invade the code base of software manufacturers or pollute widely used open source projects, it will cause a huge security threat.

Summarize

In this lesson we learned about another form of security risk where encryption fails - weak encoding.

In fact, there are many security problems about coding, mainly due to the misunderstanding of coding and encryption algorithms. Weak coding is just a microcosm of shallow problems. By understanding the essence of encoding—the conversion of information formats, you can distinguish between encoding and encryption, and then you can choose the appropriate usage scenario.

Starting from the shallow security issue of weak coding, in this lesson we further interpret some mainstream coding standards, so that we can quickly identify the coding category of the data: like ASCII occupies 1 byte, a total of 8 bits, and can describe 128 characters, suitable for English scenes; GB2312 and GBK occupy 2 bytes, a total of 16 bits, for Chinese scenes, GBK is an extension of GB2312; Unicode and UTF-8 are more ambitious, used to describe the text of countries around the world, And UTF-8 has variable length characteristics.

On the basis of understanding character encoding, we further explored common program encoding: like URL encoding, which is characterized by % at the beginning, so it is also called percent encoding, and its encoding result is the same as the original encoding of GBK and UTF-8. Very similar; and Base64 encoding, which is characterized in that the encoding results are all printable characters, and there may be an = symbol at the end of the encoding result, the main applicable scenario is the transmission of binary data; further expansion, other Base encodings also have similarities.

More in-depth security issues related to encoding are related to encoding conversion and escape character processing. Therefore, in the actual combat case section, I have selected 2 vulnerabilities to take you in-depth exploration of encoding security issues:

  1. The root of the wide byte injection problem is the combination of data and commands, but the direct trigger is that the character processing function is not fully considered, and the encoding conversion scene is not handled strictly, resulting in the consequence of encoding bypass;
  2. For the Unicode character sequence problem, take CVE-2021-42574 as an example. The root cause of the occurrence is that the IDE performs control character analysis during the process of rendering Unicode encoding, which causes developers to misunderstand the code and introduce backdoors or other security threats.

Through the study of this lesson, we can find that coding seems to be a non-program development problem, but the knowledge and principles involved are very extensive. At the same time, the security problems introduced are not easy to be discovered due to their obscure logic. Therefore, in the coding process, it is very important to have a deep understanding of the function of coding and the coding logic of the internal execution process of the program. Considering that the security problems introduced by coding are relatively hidden, we can also consider introducing excellent SAST tools into the project to help discover and locate the coding layer security issues.

thinking questions

In addition to the two encoding vulnerabilities we mentioned in this lecture, there is also a homograph character encoding vulnerability, CVE-ID is CVE-2021-42694, can you complete the vulnerability tracking and analysis by yourself?

Welcome to leave your thoughts in the comment area, and we will see you in the next class.

Article source: Geek Time " Web Vulnerability Digging Actual Combat "

Guess you like

Origin blog.csdn.net/m0_68101999/article/details/130386708