为什么在允许某些Unicode字符的注释中执行Java代码？

本文翻译自：Why is executing Java code in comments with certain Unicode characters allowed?

The following code produces the output "Hello World!" 以下代码生成输出“Hello World！” (no really, try it). （不，真的，试试吧）。

public static void main(String... args) {

   // The comment below is not a typo.
   // \u000d System.out.println("Hello World!");
}

The reason for this is that the Java compiler parses the Unicode character \ as a new line and gets transformed into: 原因是Java编译器将Unicode字符\ 解析为新行并转换为：

public static void main(String... args) {

   // The comment below is not a typo.
   //
   System.out.println("Hello World!");
}

Thus resulting into a comment being "executed". 从而导致评论被“执行”。

Since this can be used to "hide" malicious code or whatever an evil programmer can conceive, why is it allowed in comments ? 由于这可以用来“隐藏”恶意代码或恶意程序员可以设想的任何东西， 为什么在评论中允许它 ？

Why is this allowed by the Java specification? 为什么Java规范允许这样做？

#1楼

参考：https://stackoom.com/question/24vd5/为什么在允许某些Unicode字符的注释中执行Java代码

#2楼

Unicode decoding takes place before any other lexical translation. Unicode解码在任何其他词汇翻译之前进行。 The key benefit of this is that it makes it trivial to go back and forth between ASCII and any other encoding. 这样做的主要好处是可以在ASCII和任何其他编码之间来回切换。 You don't even need to figure out where comments begin and end! 你甚至不需要弄清楚评论的开始和结束位置！

As stated in JLS Section 3.3 this allows any ASCII based tool to process the source files: 如JLS第3.3节所述，这允许任何基于ASCII的工具处理源文件：

[...] The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. [...] Java编程语言指定了一种将用Unicode编写的程序转换为ASCII的标准方法，该程序将程序更改为可由基于ASCII的工具处理的形式。 [...] [...]

This gives a fundamental guarantee for platform independence (independence of supported character sets) which has always been a key goal for the Java platform. 这为平台独立性（支持的字符集的独立性）提供了基本保证，这一直是Java平台的关键目标。

Being able to write any Unicode character anywhere in the file is a neat feature, and especially important in comments, when documenting code in non-latin languages. 能够在文件中的任何位置编写任何Unicode字符是一个很好的功能，在使用非拉丁语言编写代码时，在评论中尤其重要。 The fact that it can interfere with the semantics in such subtle ways is just an (unfortunate) side-effect. 它以这种微妙的方式干扰语义的事实只是（不幸的）副作用。

There are many gotchas on this theme and Java Puzzlers by Joshua Bloch and Neal Gafter included the following variant: 关于这个主题有许多问题，Joshua Bloch和Neal Gafter的Java Puzzlers包括以下变体：

Is this a legal Java program? 这是一个合法的Java程序吗？ If so, what does it print? 如果是这样，它会打印什么？
 \p\u\b\l\i\c\ \ \ \  \c\l\a\s\s\ \U\g\l\y \{\p\u\b\l\i\c\ \ \  \ \ \ \ \s\t\a\t\i\c \v\o\i\d\ \m\a\i\n\( \S\t\r\i\n\g\[\]\ \  \ \ \ \ \a\r\g\s\)\{ \S\y\s\t\e\m\.\o\u\t \.\p\r\i\n\t\l\n\(\  "\H\e\l\l\o\ \w"\+ "\o\r\l\d"\)\;\}\} 

(This program turns out to be a plain "Hello World" program.) （这个程序原来是一个简单的“Hello World”程序。）

In the solution to the puzzler, they point out the following: 在解决益智游戏的过程中，他们指出了以下内容：

More seriously, this puzzle serves to reinforce the lessons of the previous three: Unicode escapes are essential when you need to insert characters that can't be represented in any other way into your program. 更严重的是，这个谜题有助于强化前三个课程的教训： 当您需要插入无法以任何其他方式表示的字符时，Unicode转义是必不可少的。 Avoid them in all other cases. 在所有其他情况下避免它们。

Source: Java: Executing code in comments?! 来源： Java：在评论中执行代码？！

#3楼

The \ escape terminates a comment because \\u\u003c/code> escapes are uniformly converted to the corresponding Unicode characters before the program is tokenized. \ 转义终止注释，因为\\u\u003c/code>转义在程序被标记化之前被统一转换为相应的Unicode字符。 You could equally use \W\W instead of // to begin a comment. 您也可以使用\W\W而不是//来开始评论。

This is a bug in your IDE, which should syntax-highlight the line to make it clear that the \ ends the comment. 这是IDE中的一个错误，它应该语法突出显示该行，以明确\ 结束注释。

This is also a design error in the language. 这也是语言中的设计错误。 It can't be corrected now, because that would break programs that depend on it. 它现在无法纠正，因为这会破坏依赖它的程序。 \\u\u003c/code> escapes should either be converted to the corresponding Unicode character by the compiler only in contexts where that "makes sense" (string literals and identifiers, and probably nowhere else) or they should have been forbidden to generate characters in the U+0000–007F range, or both. \\u\u003c/code>转义应该由编译器仅在“有意义”的字符串中转换为相应的Unicode字符（字符串文字和标识符，可能不在其他地方），或者它们应该被禁止在U + 0000-中生成字符007F范围，或两者兼而有之。 Either of those semantics would have prevented the comment from being terminated by the \ escape, without interfering with the cases where \\u\u003c/code> escapes are useful—note that that includes use of \\u\u003c/code> escapes inside comments as a way to encode comments in a non-Latin script, because the text editor could take a broader view of where \\u\u003c/code> escapes are significant than the compiler does. 这些语义中的任何一个都会阻止注释被\ 转义终止，而不会干扰\\u\u003c/code>转义符有用的情况 - 请注意，这包括在注释中使用\\u\u003c/code>转义作为在非转义中编码注释的方法-Latin脚本，因为文本编辑器可以更广泛地了解\\u\u003c/code>转义比编译器更重要的地方。 (I am not aware of any editor or IDE that will display \\u\u003c/code> escapes as the corresponding characters in any context, though.) （我不知道任何编辑器或IDE会在任何上下文中显示\\u\u003c/code>转义为相应的字符。）

There is a similar design error in the C family, ¹ where backslash-newline is processed before comment boundaries are determined, so eg 在C系列中存在类似的设计错误， ¹其中在确定注释边界之前处理反斜杠换行符，例如

// this is a comment \
   this is still in the comment!

I bring this up to illustrate that it happens to be easy to make this particular design error, and not realize that it's an error until it is too late to correct it, if you are used to thinking about tokenization and parsing the way compiler programmers think about tokenization and parsing. 我提出这个问题来说明这个特定的设计错误很容易发生，并且如果你习惯于考虑标记化和解析编译程序员的思维方式，那么直到修正它为时已经太晚才会发现它是错误的。关于标记化和解析。 Basically, if you have already defined your formal grammar and then someone comes up with a syntactic special case — trigraphs, backslash-newline, encoding arbitrary Unicode characters in source files limited to ASCII, whatever — that needs to be wedged in, it's easier to add a transformation pass before the tokenizer than it is to redefine the tokenizer to pay attention to where it makes sense to use that special case. 基本上，如果你已经定义了你的形式语法，然后有人想出一个语法特殊情况 - trigraphs，反斜杠换行，在源文件中编码任意Unicode字符，限制为ASCII，无论什么 - 需要楔入，它更容易在令牌化器之前添加转换传递，而不是重新定义令牌化器以注意使用该特殊情况的合理位置。

¹ For pedants: I am aware that this aspect of C was 100% intentional, with the rationale — I am not making this up — that it would allow you to mechanically force-fit code with arbitrarily long lines onto punched cards. ¹对于学龄儿童：我知道C的这个方面是100％有意识的，理由是 - 我不是这样做的 - 它可以让你用任意长线机械强制编码代码到打孔卡上。 It was still an incorrect design decision. 这仍然是一个不正确的设计决定。

#4楼

I agree with @zwol that this is a design mistake; 我同意@zwol这是一个设计错误; but I'm even more critical of it. 但我更加批评它。

\\u\u003c/code> escape is useful in string and char literals; \\u\u003c/code>转义在字符串和字符文字中很有用; and that's the only place that it should exist. 这是唯一应该存在的地方。 It should be handled the same way as other escapes like \\n ; 它应该像其他转义一样处理，如\\n ; and "\ " should mean exactly "\\n" . 而"\ " 应该恰好代表"\\n" 。

There is absolutely no point of having \\uxxxx in comments - nobody can read that. 绝对没有\\uxxxx在评论中使用\\uxxxx - 没有人可以阅读。

Similarly, there's no point of using \\uxxxx in other part of the program. 同样，在程序的其他部分使用\\uxxxx也没有意义。 The only exception is probably in public APIs that are coerced to contain some non-ascii chars - what's the last time we've seen that? 唯一的例外可能是在强制包含一些非ascii字符的公共API中 - 我们最后一次看到它是什么？

The designers had their reasons in 1995, but 20 years later, this appears to be a wrong choice. 设计师在1995年有他们的理由，但20年后，这似乎是一个错误的选择。

(question to readers - why does this question keep getting new votes? is this question linked from somewhere popular?) （向读者提问 - 为什么这个问题不断获得新的选票？这个问题是否从流行的地方联系起来？）

#5楼

Since this hasn't addressed yet, here an explanation, why the translation of Unicode escapes happens before any other source code processing: 由于尚未解决，这里有一个解释，为什么Unicode转义的转换发生在任何其他源代码处理之前：

The idea behind it was that it allows lossless translations of Java source code between different character encodings. 它背后的想法是它允许在不同的字符编码之间无损地翻译Java源代码。 Today, there is widespread Unicode support, and this doesn't look like a problem, but back then it wasn't easy for a developer from a western country to receive some source code from his Asian colleague containing Asian characters, make some changes (including compiling and testing it) and sending the result back, all without damaging something. 今天，有广泛的Unicode支持，这看起来不是一个问题，但是当时西方国家的开发人员从他的亚洲同事那里收到一些包含亚洲字符的源代码并不容易做出一些改变（包括编译和测试它并将结果发回，所有这些都不会损坏。

So, Java source code can be written in any encoding and allows a wide range of characters within identifiers, character and String literals and comments. 因此，Java源代码可以用任何编码编写，并允许标识符，字符和String文字和注释中的各种字符。 Then, in order to transfer it losslessly, all characters not supported by the target encoding are replaced by their Unicode escapes. 然后，为了无损地传输它，目标编码不支持的所有字符都被它们的Unicode转义替换。

This is a reversible process and the interesting point is that the translation can be done by a tool which doesn't need to know anything about the Java source code syntax as the translation rule is not dependent on it. 这是一个可逆的过程，有趣的是，转换可以通过一个工具来完成，该工具不需要知道任何关于Java源代码语法的知识，因为转换规则不依赖于它。 This works as the translation to their actual Unicode characters inside the compiler happens independently to the Java source code syntax as well. 这适用于编译器内部实际Unicode字符的转换也独立于Java源代码语法。 It implies that you can perform an arbitrary number of translation steps in both directions without ever changing the meaning of the source code. 这意味着您可以在两个方向上执行任意数量的转换步骤，而无需更改源代码的含义。

This is the reason for another weird feature which hasn't even mentioned: the \\uuuuuuxxxx syntax: 这就是另一个奇怪的功能，甚至没有提到的原因： \\uuuuuuxxxx语法：

When a translation tool is escaping characters and encounters a sequence that is already an escaped sequence, it should insert an additional u into the sequence, converting \쫾 to \\uucafe . 当翻译工具转义字符并遇到已经是转义序列的序列时，它应该在序列中插入一个额外的u ，将\쫾转换为\\uucafe 。 The meaning doesn't change, but when converting into the other direction, the tool should just remove one u and replace only sequences containing a single u by their Unicode characters. 意思不会改变，但是当转换到另一个方向时，工具应该只删除一个u并仅用Unicode字符替换包含单个u的序列。 That way, even Unicode escapes are retained in their original form when converting back and forth. 这样，即使Unicode转义在来回转换时也会以原始形式保留。 I guess, no-one ever used that feature… 我想，没有人曾经使用过这个功能......

#6楼

This was an intentional design choice that goes all the way back to the original design of Java. 这是一个有意的设计选择，一直回到Java的原始设计。

To those folks who ask "who wants Unicode escapes in comments?", I presume they are folks whose native language uses the Latin character set. 对于那些问“谁想要在评论中逃脱Unicode？”的人，我认为他们是那些母语使用拉丁字符集的人。 In other words, it is inherent in the original design of Java that folks could use arbitrary Unicode characters wherever legal in a Java program, most typically in comments and strings. 换句话说，Java的原始设计中固有的，人们可以在Java程序中的任何合法地方使用任意Unicode字符，最常见的是在注释和字符串中。

It is arguably a shortcoming in programs (like IDEs) used to view the source text that such programs cannot interpret the Unicode escapes and display the corresponding glyph. 可以说，用于查看源文本的程序（如IDE）的缺点是这些程序无法解释Unicode转义并显示相应的字形。

w36680130

发布了0 篇原创文章 · 获赞 73 · 访问量 55万+

私信关注

为什么在允许某些Unicode字符的注释中执行Java代码？

#1楼

#2楼

#3楼

#4楼

#5楼

#6楼

猜你喜欢