[openjdk][translation] in defense of type erasure

Recently, I am more interested in Proposals, read a little JEP, and plan to read PEP next week.
I accidentally saw the article in-defense-of-erasure on Zhihu , what Glavo said

"…(Isomorphic Generics) This approach has a powerful advantage that cannot be obtained otherwise: compatibility with gradual migration. It is a method that does not break existing source code or binary class files. The ability to convert a non-generic class compatible to a generic class in the case of

I didn't understand it very well, so I translated this article from the openjdk official website. This article is great and well worth a look. - Annotation


Backstory: How We Got Generics Now (Or, How I Learned to Stop Worrying and Love Type Erasure)

Brian Goetz
June 2020

Before we talk about where generics are going, let's first talk about where they are, and how they got there. This document will focus primarily on how we got the generics we have now, and why, as a means of laying the groundwork for how our current generics will affect the "better" generics we're trying to build.
In particular, type erasure was a sensible and pragmatic choice Java made when it added generics in 2004, and the reasons that make us transpiled by erasure still apply today.

Erasure
Ask any developer about Java generics, and you're likely to hear complaints about erasure. Because erasure is probably the most widely misunderstood concept in Java.
Erasure is not specific to java, nor is it specific to generics. It is usually the necessary tool for translating code to a lower level (such as from java to bytecode, from c to source code). This is because as we go up the stack from high-level language to intermediate representation to native code to hardware, the type abstractions provided by lower levels are almost always simpler and weaker than those provided by higher levels — which is true (because we Don't want to copy the semantics of virtual dispatch into the x86 instruction set, and don't want to emulate Java's primitive type set in registers). Ideally, erasure is a technique for mapping a higher-level type-level representation to a lower-level, less-rich type after the high-level type has been fully type-checked, and it's something compilers do every day.
For example, the Java bytecode set contains instructions for moving integer values ​​between the stack and local variable sets (iload, istore), and for performing arithmetic operations on ints (iadd, imul, etc.). There are similar instructions for floating point (fload, fstore, fmul, etc.), long (lload, lstore, lmul), double (dload, dstore, dmul), and object references (aload, astore). ) but no such instructions for bytes, short chars, chars or booleans - because the compiler erases those types to int and uses int move and arithmetic instructions. This is a practical design trade-off in the design of bytecode instruction sets; it reduces the complexity of the instruction set and thus improves runtime efficiency. Many other features of the Java language (e.g., checked exceptions, method overloading, enumerations, definite assignment analysis, nested classes, lambdas or local classes capturing local variables, etc.) are "language fictions"—they are in Java is checked in the compiler, but removed when converting to a class file.

Isomorphic and heterogeneous translation
There are two common approaches to translating generic types in languages ​​with parametric polymorphism — isomorphic and heterogeneous translation. In isomorphic translation, the generic class Foo<T> is translated into a single artifact (which can be understood as a compilation result unit – Annotation), such as Foo.class (the same is true for generic methods). In heterogeneous translation, each instantiation of a generic type or method (Foo<String>, Foo<Integer>) is treated as a separate entity and generates a separate artifact. For example, C++ uses heterogeneous translation: different instantiations of a template are completely different types, with different semantics and different generated code. vector and vector types are separate types. On the one hand, this is great for type safety (each instantiation can be individually type-checked after extension) and quality of generated code (since each instantiation can be individually optimized). On the other hand, this means a larger code footprint (since vector and vector have separate code) , and we can't talk about "vector of some" (like Java does with wildcards), because every instantiation is are completely unrelated types.
As an extreme demonstration of the possible space cost, Scala has a @specialized annotation which, when applied to a type variable, causes the compiler to emit specialized versions for all primitive types. This sounds cool, but causes the generated classes to explode, where is the number of private type variables in the class, so a 100MB JAR can easily be generated from a few lines of code.
The choice between isomorphic and heterogeneous translation involves various trade-offs that language designers have always made. Heterogeneous translation provides more type specificity at the cost of a larger static and dynamic footprint and less runtime sharing — all with performance implications. Isomorphic translation makes it easier to abstract parameterized type families, such as Java's wildcards or C#'s declaration-site variance (both of which are lacking in C++, for which there is no common ground between vector and vector).

Java Type Erasure
Java uses isomorphic translation to translate generics. Generics are type checked at compile time, but when generating bytecode, generic types like List<String> are erased into List, and type variables like <T extends Object> are compared with it is erased together with its binding (in this case Object).
For example:

class Box<T> {
    
    
    private T t;

    public Box(T t) {
    
     this.t = t; }

    public Box<T> copy() {
    
     return new Box<>(t); }

    public T t() {
    
     return t; }
}

The javac compiler produces a class file: Box.class, the implementation of all instantiations of Box - including: wildcards (Box<?>), primitive types (Box), fields, methods, and superclass descriptors are all erased. Type variables are erased to the boundary, and generic types are erased to the head (for example, List<String> is erased to List), as follows:

class Box {
    
    
    private Object t;

    public Box(Object t) {
    
     this.t = t; }

    public Box copy() {
    
     return new Box(t); }

    public Object t() {
    
     return t; }
}

The generic signature is preserved in the Signature attribute so that the compiler can see the generic signature when reading the class file, but the JVM only uses the erased descriptor in linking. This conversion scheme means that at the class file level, the layout and API of Box<T> are deleted. When used, the reference to Box<String> is deleted as Box, and when used, there is an implicit conversion to String.

Why? What are the options?
It is at this point that it is easy to get angry and declare that these are obviously stupid or lazy choices, or that erasure is a dirty practice. After all, why would the compiler throw away perfect type information?

To better understand the question, we should also ask: If we materialize this type of information, what do we hope to do with it, and what are the costs associated with it? We can imagine several different ways of using the reified type parameter information:

  • Reflection For some, "reified generics" simply means that you can ask List what list it is, whether using language tools like instanceof or pattern matching, or using a reflection library to query type parameters.

  • Layout or API specialization . In languages ​​with primitive types or inline classes, the layout of Pair<int, int> can be flattened to accommodate two ints instead of two references to boxed objects.

  • Runtime type checking . When a client tries to put an Integer into a List<String> (for example, via a raw List reference), which would cause a heap exception, it would be better to catch this and fail when it causes a heap exception, not when it It is detected when a synthetic cast is reached.

While these three possibilities are not mutually exclusive, they (reflection, specialization, and type checking) serve different goals (programmer convenience, performance, and security)—and have different implications and costs. While it's easy to say "we want reification," if we dig deeper, we find major disagreements about which of these are the most important, and their relative costs and benefits.

To understand how sensible and pragmatic erasure was here, we must also understand the goals, priorities and constraints and alternatives at the time.

Goal: Gradual migration compatibility
Java Generics adopts an ambitious requirement:

It must be possible to evolve existing non-generic classes to be generic in a binary-compatible and source-compatible manner.

This means that existing clients and subclasses (such as ArrayList ) can continue to be recompiled without requiring changes to the generalized ArrayList<T>, and existing class files will continue to link to the generalized ArrayList<T> T> method. Supporting this means that clients and subclasses of generalized classes can choose to generalize now, later, or never, and independently of what the maintainers of other clients or subclasses choose to do.

Without this requirement, generalizing a class would require a "flag day" during which all clients and subclasses would have to be recompiled at least once, if not modified. For a core class like ArrayList, which basically requires all Java code in the world to be immediately recompiled (or permanently downgraded to stay on Java 1.4), we need a generic type system that allows core platform classes (and popular third-party libraries) are genericized without requiring clients to know about their generics. (Worse, it won't be one flag day, but many, since all the code in the world won't be generalized in one atomic transaction.)

Another way of expressing this need is to drop all code that could have been genericized, or to let developers choose between being generic and keeping the implementation they've already made in existing code, which is unacceptable. By making generalization a compatible operation, the implementation of that code is preserved rather than invalidated.

The aversion to "flag day" comes from an important aspect of Java's design: **Java is compiled and dynamically linked separately. **Individual compilation means that each source file is compiled into one or more class files, rather than compiling a set of sources into a single artifact. Dynamic linking means that references between classes are linked at runtime based on symbolic information; if class C calls a method in D void m(int x), then in C's class file we record the name and descriptor of the method being called ((I ) V), when linking, we will look in D for a method with this name and descriptor, and if a match is found, the call site will be linked.

This might sound like a lot of work, but independent compilation and dynamic linking is one of Java's greatest strengths—you can compile C against one version of D and run it with a different version of D on the classpath (as long as you don't any binary incompatible changes).

The general commitment to dynamic linking enables us to simply put a new JAR on the classpath to update to new versions of dependencies without recompiling anything. We do it so often that we don't even care about it - but if the approach stops, it does get noticed.

When generics were introduced into Java, there was already a lot of Java code in the world, and their class files were full of references to APIs such as Java.util.ArrayList. If we can't generalize these APIs compatibly, then we'll have to write new APIs to replace them, and worse, all client code for the old APIs will be stuck with an unsupportable choice — either use 1.4 forever, or Rewrite them to use the new API at the same time (not only the application code, but all third-party libraries that the application depends on). This would render almost all Java code in existence at the time worthless.

C# made the opposite choice - updating their VM implementation and invalidating their existing library and all user code that depended on it. They could do that at the time because there was relatively little C# code in the world; Java didn't have that option at the time.

However, one consequence of this choice is that a generic class will have both generic and non-generic users or subclasses, which is an expected situation. This is very helpful for the software development process, but it has potential consequences for type safety in this mixed use.

Heap pollution
erasing in this way, and enabling interoperability between generic and non-generic clients, creates the possibility of heap pollution — the runtime type stored in the box is not the expected compile-time type compatible. When the client uses Box<String>, whenever T is assigned to String, a cast is inserted to detect heap pollution when data is converted from the world of type variables (Box's implementation) to the world of concrete types. These casts may fail in the presence of heap pollution.

Heap pollution can come from non-generic code using generic classes, or we using unchecked casts or primitive types to fake references to variables of the wrong generic type. (When we use unchecked casts or primitive types, the compiler warns us that we may cause heap pollution.)
For example:

Box<String> bs = new Box<>("hi!");   // safe
Box<?> bq = bs;                      // safe, via subtyping
Box<Integer> bi = (Box<Integer>) bq; // unchecked cast -- warning issued
Integer i = bi.get();                // ClassCastException in synthetic cast to Integer

The problem in this code is that Box<? > unchecked cast to Box; we have to convince the developer that the specified Box is actually a box<Integer>. But the heap pollution isn't caught right away; it's only when we try to use the string in the box as an integer that we can detect that something went wrong. Under our conversion, if we convert the box to a box<Integer> before using it as a box<String>, and then convert it back to a box<String>, nothing bad will happen (whether good bad).

Java actually provides fairly strong safety guarantees for generics, as long as we follow the following rules:

Synthetic casts inserted by the compiler will never fail if the program is compiled without unchecked or raw warnings.

In other words, heap pollution can only happen when we interoperate with non-generic code or lie to the compiler. When heap pollution is detected, we get a simple and clear exception telling us the expected type and the actual type.

context: JVM implementation and language ecosystem

Design choices around generics are also influenced by the JVM implementation and the ecosystem structure of the language running on the JVM. While "Java" appears to most developers as a monolithic entity, in reality the Java language and the Java Virtual Machine (JVM) are separate entities, each with its own specification. The Java compiler produces class files (whose format and semantics are specified in the Java Virtual Machine Specification) for the JVM, but the JVM will happily run any valid class file, regardless of what source language it originally came from. According to statistics, there are more than 200 languages ​​that use the JVM as a compilation target, some of which have a lot in common with the Java language (such as Scala, Kotlin), and others that are very different (such as JRuby, Jython, Jaskell)

One reason the JVM has been so successful as a compilation target, even for languages ​​quite different from Java, is because it provides a fairly abstract model for computation, whereas the Java language has limited influence. The abstraction layer between the language and the virtual machine not only helps stimulate an ecosystem of other languages ​​running on the JVM, but also an ecosystem of JVM-independent implementations. While today's market is substantially consolidated, when it comes to adding generics to Java, the JVM has a dozen commercially viable implementations. Reifying generics means that not only do we need to enhance the language to support generics, we also need to enhance the JVM.

While it was technically possible at the time to add generics support to the JVM, not only would this be a significant engineering investment requiring a lot of coordination and agreement among many implementors, the language ecosystem on the JVM would also likely have limited support for reified generics have opinions. For example, is Scala (and its declaration-site variances such as contravariance and covariance) willing to let the JVM enforce Java's (invariant) generic subtyping rules if the interpretation of the reification includes runtime type checking?

Erasure is the most pragmatic compromise
Taken together, these constraints (technical and ecosystem) act as a strong impetus to adopt an isomorphic translation strategy that removes generic type information at compile time. In summary, the forces driving our decision include:

  • run time cost. Heterogeneous generics require various runtime costs: larger static and dynamic footprints, greater class loading costs, higher JIT costs and code cache pressure, etc. This may force developers to choose between type safety and performance.

  • Migration compatibility. At the time there was no known translation scheme that would allow reified generics to be ported where the source and binary codes were compatible, creating a flag day and invalidating a significant developer investment in existing code.

  • run time cost. If reification is interpreted as checking types at runtime (just like storage in Java's covariant arrays is checked dynamically), this will have a significant runtime impact, because the JVM must use the language's generic type system at runtime to Each field or array element store performs generic subtype checking. (This sounds easy and has no overhead when the type is something as simple as List<String>, but when the type is Map<? extends List<? super-Foo>>, ? super-Set<? extended Bar> , this can quickly become expensive. (Indeed, later research has questioned the decidability of generic subtyping ).

  • The JVM ecosystem. Getting a dozen JVM vendors to agree on whether and how to materialize type information at runtime is a very questionable proposition.

  • Deliver pragmatism. Even if it were possible to get a dozen JVM vendors to agree on a practical solution, it would greatly increase the complexity, time, and risk of an already sizable and risky effort.

  • language ecosystem. A language like Scala might be reluctant to let Java's immutable generics fit into the JVM's semantics. Agreeing on an acceptable set of cross-language semantics for generics in the JVM would again add complexity, timelines, and risks to an already sizeable and risky effort.

  • The user has to deal with erasures (and heap pollution) anyway . Even if type information could be preserved at runtime, there will always be problematic class files compiled before the class is generalized, so any given ArrayList in the heap may have no additional type information, with the attendant risk of heap pollution. (I don’t understand a little bit here – Annotation)

  • Certain useful idioms cannot be expressed. Existing generic code occasionally resorts to unchecked casts when it knows some runtime type that the compiler doesn't, and has no easy way to express it in the generic type system; for reified generics Saying that, many of these techniques are impossible, which means they have to be expressed differently, and often much more expensively.

Obviously, the costs and risks will be enormous; what are the benefits? Earlier, we enumerated three possible benefits of reification: reflection, layout specialization, and runtime type checking. The above arguments basically rule out the possibility that we might have runtime type checking (runtime cost, risk of undecidability, ecosystem risk, and presence of erased instances).

Of course, it would be nice to be able to ask List what its element type is (maybe it can answer, but probably not) - there's obviously a payoff. It's just that the costs and benefits differ by orders of magnitude. (Another cost of the chosen generics strategy is the inability to have primitives as type parameters; we must use List<Integer> instead of List<Int>.)

The common misconception that erasure is "a dirty hack" often stems from a lack of awareness of the true costs of alternatives, including engineering effort, time to market, delivery risk, performance, ecosystem impact, and we have to account for the vast amount of Java code that has already been written And the diverse ecosystem of JVM implementations and languages ​​that run on the JVM.

Guess you like

Origin blog.csdn.net/treblez/article/details/127468821