Why execute Java code in comments allow certain Unicode characters in?

本文翻译自:Why is executing Java code in comments with certain Unicode characters allowed?

The following code produces the output "Hello World!" The following code generates an output "the Hello World!" (NO for Really, the try IT). (No, really, try it).

public static void main(String... args) {

   // The comment below is not a typo.
   // \u000d System.out.println("Hello World!");
}

For that reason the this IS at The at The Java Compiler parses at The Unicode Character \ AS A new new Line and the gets transformed INTO: The reason is that the Java compiler Unicode characters are \ parsed as a new line and converted to:

public static void main(String... args) {

   // The comment below is not a typo.
   //
   System.out.println("Hello World!");
}

Thus resulting into a comment being "executed ". Resulting comments were "executed."

Used to the this CAN BE Operating since "hide" Malicious code or Whatever AN Evil Programmer CAN conceive, Why IS IT allowed in Comments ? Because it can be used for anything "hidden" malicious code or malicious programmers can conceive of why comments allow it ?

Why is this allowed by the Java specification ? Why Java specification to allow this?


#1st Floor

Reference: https://stackoom.com/question/24vd5/ Why allow certain Unicode characters in the comment execute Java code


#2nd Floor

Decoding Takes Place the before the any Unicode OTHER lexical - Search.com. Unicode decoding before any other word translation. The key benefit of this is that it makes it trivial to go back and forth between ASCII and any other encoding. The main benefit of this is that you can switch back and forth between ASCII and any other code. You do not even need to figure out where comments begin and end! You do not even need to figure out a review of the start and end positions!

Stated in AS JLS Section 3.3 the this android.permission to the any ASCII based Tool Source Files The Process: The JLS Section 3.3 above, this allows the processing tool based on any of the ASCII source file:

[...] the Java Programming Language at The Standard specifies A Way of Transforming A Program written in ASCII Unicode INTO A Program that Changes INTO A form that CAN BE Processed by ASCII-based Tools. [...] the Java programming language specifies a the kind of conversion programs written in Unicode standard ASCII method, the program will change the program may be in the form of ASCII-based tools for processing. [...] [...]

This gives a fundamental guarantee for platform independence (independence of supported character sets) which has always been a key goal for the Java platform. This provides a basic guarantee for the platform independence (independence supported character set), which has been Java the key target platform.

Being able to write any Unicode character anywhere in the file is a neat feature, and especially important in comments, when documenting code. In non-latin languages to write any Unicode characters in the file anywhere is a nice feature, when writing code that uses non-Latin languages, especially important in the comments. The fact that it can interfere with the semantics in such subtle ways is just an (unfortunate) side-effect. It is in this subtle way interfere with the semantics of the fact that only (unfortunate) side effects.

ON the this gotchas are MANY There Theme and Java Puzzlers by Joshua Bloch and Neal Gafter at The included following the Variant: On this topic there are many problems, Joshua Bloch and Neal Gafter of Java Puzzlers include the following variants:

Is this a legal Java program? This is a legitimate Java program do? If so, what does it print? If so, what will it print?

 \p\u\b\l\i\c\ \ \ \  \c\l\a\s\s\ \U\g\l\y \{\p\u\b\l\i\c\ \ \  \ \ \ \ \s\t\a\t\i\c \v\o\i\d\ \m\a\i\n\( \S\t\r\i\n\g\[\]\ \  \ \ \ \ \a\r\g\s\)\{ \S\y\s\t\e\m\.\o\u\t \.\p\r\i\n\t\l\n\(\  "\H\e\l\l\o\ \w"\+ "\o\r\l\d"\)\;\}\} 

(This Program turns to BE OUT A Plain "Hello World" Program.) (This program turned out to be a simple "Hello World" program.)

In the solution to the puzzler, they point out the following: In the process of solving a puzzle game in which they pointed out the following:

Seriously More, Puzzle Serves as mentioned in the this to Reinforce at The Lessons of Three at The Previous: Unicode ESCAPES Essential are the when you need to INSERT characters that not CAN BE Represented in the any OTHER INTO your Way Program. More seriously, this puzzle helps the first three courses strengthening lesson: when you need to insert a character can not be expressed in any other way, Unicode escapes are essential. Avoid them in all other cases. Avoid them in all other cases.


Source: the Java: Executing code in Comments ?! Source: the Java: execute code in the comments? !


#3rd floor

The \ escape terminates a comment because \\u\u003c/code> escapes are uniformly converted to the corresponding Unicode characters before the program is tokenized. \ 转义终止注释,因为\\u\u003c/code>转义在程序被标记化之前被统一转换为相应的Unicode字符。 You could equally use \W\W instead of // to begin a comment. 您也可以使用\W\W而不是//开始评论。

This is a bug in your IDE, which should syntax-highlight the line to make it clear that the \ ends the comment. 这是IDE中的一个错误,它应该语法突出显示该行,以明确\ 结束注释。

This is also a design error in the language. 这也是语言中的设计错误。 It can't be corrected now, because that would break programs that depend on it. 它现在无法纠正,因为这会破坏依赖它的程序。 \\u\u003c/code> escapes should either be converted to the corresponding Unicode character by the compiler only in contexts where that "makes sense" (string literals and identifiers, and probably nowhere else) or they should have been forbidden to generate characters in the U+0000–007F range, or both. \\u\u003c/code>转义应该由编译器仅在“有意义”的字符串中转换为相应的Unicode字符(字符串文字和标识符,可能不在其他地方),或者它们应该被禁止在U + 0000-中生成字符007F范围,或两者兼而有之。 Either of those semantics would have prevented the comment from being terminated by the \ escape, without interfering with the cases where \\u\u003c/code> escapes are useful—note that that includes use of \\u\u003c/code> escapes inside comments as a way to encode comments in a non-Latin script, because the text editor could take a broader view of where \\u\u003c/code> escapes are significant than the compiler does. 这些语义中的任何一个都会阻止注释被\ 转义终止,而不会干扰\\u\u003c/code>转义符有用的情况 - 请注意,这包括在注释中使用\\u\u003c/code>转义作为在非转义中编码注释的方法-Latin脚本,因为文本编辑器可以更广泛地了解\\u\u003c/code>转义比编译器更重要的地方。 (I am not aware of any editor or IDE that will display \\u\u003c/code> escapes as the corresponding characters in any context, though.) (我不知道任何编辑器或IDE会在任何上下文中显示\\u\u003c/code>转义为相应的字符。)

There is a similar design error in the C family, 1 where backslash-newline is processed before comment boundaries are determined, so eg 在C系列中存在类似的设计错误, 1其中在确定注释边界之前处理反斜杠换行符,例如

// this is a comment \
   this is still in the comment!

I bring this up to illustrate that it happens to be easy to make this particular design error, and not realize that it's an error until it is too late to correct it, if you are used to thinking about tokenization and parsing the way compiler programmers think about tokenization and parsing. 我提出这个问题来说明这个特定的设计错误很容易发生,并且如果你习惯于考虑标记化和解析编译程序员的思维方式,那么直到修正它为时已经太晚才会发现它是错误的。关于标记化和解析。 Basically, if you have already defined your formal grammar and then someone comes up with a syntactic special case — trigraphs, backslash-newline, encoding arbitrary Unicode characters in source files limited to ASCII, whatever — that needs to be wedged in, it's easier to add a transformation pass before the tokenizer than it is to redefine the tokenizer to pay attention to where it makes sense to use that special case. 基本上,如果你已经定义了你的形式语法,然后有人想出一个语法特殊情况 - trigraphs,反斜杠换行,在源文件中编码任意Unicode字符,限制为ASCII,无论什么 - 需要楔入,它更容易在令牌化器之前添加转换传递而不是重新定义令牌化器以注意使用该特殊情况的合理位置。

1 For pedants: I am aware that this aspect of C was 100% intentional, with the rationale — I am not making this up — that it would allow you to mechanically force-fit code with arbitrarily long lines onto punched cards. 1对于学龄儿童:我知道C的这个方面是100%有意识的,理由是 - 我不是这样做的 - 它可以让你用任意长线机械强制编码代码到打孔卡上。 It was still an incorrect design decision. 这仍然是一个不正确的设计决定。


#4th floor

I agree with @zwol that this is a design mistake; I agree @zwol this is a design error; . But the even the I'm More Critical of IT but I am even more criticism of it.

\\u\u003c/code> escape is useful in string and char literals; \\u\u003c/code>转义在字符串和字符文字中很有用; and that's the only place that it should exist. 这是唯一应该存在的地方。 It should be handled the same way as other escapes like \\n ; 它应该像其他转义一样处理,如\\n ; and "\ " should mean exactly "\\n" . "\ " 应该恰好代表"\\n"

There is absolutely no point of having \\uxxxx in comments - nobody can read that. 绝对没有\\uxxxx在评论中使用\\uxxxx - 没有人可以阅读。

Similarly, there's no point of using \\uxxxx in other part of the program. 同样,在程序的其他部分使用\\uxxxx也没有意义。 The only exception is probably in public APIs that are coerced to contain some non-ascii chars - what's the last time we've seen that? 唯一的例外可能是在强制包含一些非ascii字符的公共API中 - 我们最后一次看到它是什么?

The designers had their reasons in 1995, but 20 years later, this appears to be a wrong choice. 设计师在1995年有他们的理由,但20年后,这似乎是一个错误的选择。

(question to readers - why does this question keep getting new votes? is this question linked from somewhere popular?) (向读者提问 - 为什么这个问题不断获得新的选票?这个问题是否从流行的地方联系起来?)


#5th Floor

Since this has not addressed yet, here an explanation, why the translation of Unicode escapes happens before any other source code processing: Due to unresolved, there is an explanation as to why escaped Unicode conversion occurs before any other source code processing:

The idea behind it was that it allows lossless translations of Java source code between different character encodings. It is the idea behind it is that it allows between different character encodings lossless translated Java source code. Today, there is widespread Unicode support, and this does not look like a problem, but back then it was not easy for a developer from a western country to receive some source code from his Asian colleague containing Asian characters, make some changes ( including compiling and testing it) and sending the result back, all without damaging something. today, there is broad support for Unicode, which does not seem a problem, but when developers received some western countries from Asia include his Asian colleague the character of the source code is not easy to make some changes (including compile and test it and the results sent back, all without damage.

SO, Java CAN BE Source code written in the any encoding and android.permission WITHIN A Wide Range of characters identifiers, and Character String. Comments for literals is and therefore, Java source code may be written in any encoding and permission identifier, character and Stringtext and annotation various characters. Then, in order to transfer it losslessly , all characters not supported by the target encoding are replaced by their Unicode escapes. Then, in order to transfer its non-destructive, the target encoding does not support all Unicode characters are replaced with their escape.

This is a reversible process and the interesting point is that the translation can be done by a tool which does not need to know anything about the Java source code syntax as the translation rule is not dependent on it. This is a reversible process, Interestingly, the conversion can be done through a tool that does not need to know anything about Java source code syntax, because the conversion rules do not depend on it. This works as the translation to their actual Unicode characters inside the compiler happens independently to the Java source code syntax as well. This applies to internal compiler actual Unicode character conversion is also independent of the Java source code syntax. It implies that you can perform an arbitrary number of translation steps in both directions without ever changing the meaning of the source code. This means that you can perform any number of conversion steps in both directions without changing the source code of meaning.

Another IS at The reason for the this Weird Which has not the even the Feature Mentioned: at The \\uuuuuuxxxxsyntax: This is another strange feature, the reason is not even mentioned: \\uuuuuuxxxxsyntax:

A - Search.com Tool IS escaping the when characters and Encounters A Sequence that already IS AN ESCAPED Sequence, Should IT AN Additional INSERT uINTO at The Sequence, Converting \쫾to \\uucafe. When the escape character translation tool and is already experiencing an escape sequence, it should in an additional sequence is inserted u, it will be \쫾converted to \\uucafe. Meaning does not Change at The, But the when Converting at The INTO OTHER direction, at The Tool Should the Remove the Just One uand the replace Sequences containing only A SINGLE uby Their Unicode characters. Meaning does not change, but when switching to another direction, the tool should only delete a uand Unicode character substitutions only contains a single usequence. That way, even Unicode escapes are retained in their original form. When converting back and forth like this, it will be retained in their original form even in Unicode escape back and forth. I guess, no-one ever used that feature ... I think no one ever used this feature ......


#6th floor

This was an intentional design choice that goes all the way back to the original design of Java. This is a deliberate design choice, Java has been returned to the original design.

To those folks who ask "who wants Unicode escapes in comments?", I presume they are folks whose native language uses the Latin character set. For those asking "who want to escape the Unicode in the comments?" People, I think they are those who use the mother tongue Latin character set. In other words, it is inherent in the original design of Java that folks could use arbitrary Unicode characters wherever legal in a Java program, most typically in comments and strings. In other words, the original Java inherent in the design, it may be Java any legitimate place in the program using any Unicode character, the most common is in the comments and strings.

It is arguably a shortcoming in programs ( like IDEs) used to view the source text that such programs can not interpret the Unicode escapes and display the corresponding glyph. We can say that the disadvantage to view the source text of the program (such as IDE) is that these programs Unicode escapes can not explain and display the corresponding shape.

Original articles published 0 · won praise 73 · views 550 000 +

Guess you like

Origin blog.csdn.net/w36680130/article/details/105241608