Kona JDK practice and development in the field of big data Tencent

 Recently, the cloud + technology community salon "Tencent open source technology" come to an end. The salon invited a number of technical experts Tencent, Tencent depth unveils open source project TencentOS tiny, TubeMQ, Kona JDK, TARS and MedicalNet. This article is about the teacher Yang Xiaofeng Tencent details OpenJDK-based self-study Kona JDK open source projects.

A, Tencent Kona origin

1. OpenJDK

Often hear people talk about OpenJDK, that it in the end what is it? I believe we have heard about Java SE, ME, EE and other specifications, in the usual sense of the definition of Open JDK means: a free and open source reference implementation of Java SE specification.

The earliest in the year 2006, Sun promised to gradually open source core Java Platform, including hotspot, Complier and class libraries; the second year, Redhat join and publish IcedTea, which is based entirely on the version of the GNU free software build.

2010 occurred when a very big change, Oracle took Stewardship from the hands of Sun, IBM join and give up Apache Harmony, Apple joined the OpenJDK.

2014, JDK 8 release, it is by far the fastest adoption and acceptance of the highest version, is still the main production environment . Industry survey this year's show, JDK8 almost all manufacturers are still the main production version of the JDK.

2017, after three years of research and development, JAVA 9 release, this year a series of dizzying change.

First, from a technical level, the introduction of native JDK modular system , which is the Project Jigsaw - JPMS (Java Platform Module System) , JPMS for future rapid development of the Java language and JDK get rid of some obstacles, but it also brings some compatibility the problem, JDK 9 did not become widespread as expected general replacement for JDK 8's.

Second, Java features to drive the transition from time-driven publishing model. This is due to the rapid development in the field of cloud computing, Java to bring a very big challenge.

The traditional publishing model, ideally about to release for a period of two years, but we are in the actual development of the JDK 9, has dragged on for about a year, even if there is still some work plan is not done.

Long development cycle leading to a large number of technologies already in place, the delay can not enter the production stage, which greatly reduces the speed Java development and iteration. To accommodate this change, JDK switch to the new publication mode for the six months period.

Third, Oracle JDK open source commercial properties, and change Oracle JDK pricing strategy , greatly reducing the JDK free support period.

The good news is that while Java experienced a "fee" storm, in fact, today's activity and participation OpenJDK community are greatly improved. Tencent, Microsoft and other vendors have joined the community, and began to actively contribute OpenJDK.

2. Oracle JDK

What is the relationship between it and the Oracle JDK OpenJDK? Simply make a comparison, Oracle JDK 8 can be considered a superset of OpenJDK, includes its own commercial properties.

But after the development JDK11, the entire JDK product form undergone a big change, from a large single application, made some decoupling, JMC, OpenJFX and other independent JDK as a package, Oracle its commercial properties are also open out, so Oracle JDK 11 and OpenJDK 11 there is only License difference.

3. Tencent Kona JDK

Tencent Kona JDK JDK release is based on the development of the main branch of OpenJDK, and adhere to several principles.

First, it is free, it can be assured use.

Second, we will for the LTS version of this JDK8, JDK11, provide long-term, reliable support.

Third, Kona JDK after Tencent massive load verification, to ensure the ability to produce environment-ready.

So, between Tencent Kona Open JDK and what is the relationship? What are the advantages and development ideas have it?

First of all, we are committed to Kona is a "friend-Fork", and will maximize the Kona ensure compatibility. Kona team and a large number of internal users and partners JDK team to conduct extensive communication and provide a high standard of compatibility and stability, reliability, and performance.

Kona JDK not going to innovation and innovation, because in fact, hurt JDK compatibility is not responsible for technology investment, will bring future production migration and maintenance costs to the user.

In addition, Tencent based on vast amounts of big data, cloud and other Java / JVM load, we practice experience and technology precipitation at this scale will share out the scene, and gradually form Bugfix, Enhancement Feature or open source for everyone to use.

From the perspective of community engagement, we have worked with Oracle Java Open JDK community and product teams made the communication, and the entire community to communicate the message clearly above principles, in strict accordance with standards of governance community to ask themselves, do not do rentier.

Two, OpenJDK technology trends

Back to the OpenJDK itself, what are the main trends of concern now has it?

The entire summary John Rose (Oracle JVM Architect) a.

The first Java-on-Java

Java-on-Java, Java language development are defined by JVM virtual machine.

For example, the current C1 / C2 JIT (Just-In-Time) compiler, primarily developed in C ++, the code has been very difficult to improve, Oracle in GraalVM and other projects, and gradually the practice of the Java-based JVM virtual machine, its technology will gradually become the core of the future of the JVM.

The general is still in the experimental stage, but has shown very impressive results. For example, using its native-machine and SubstrateVM, in terms of speed and memory startup Footprint like can improve a very breakthrough, a simple program can be seen even 30-fold improvement.

Although it is based on Close-World-Assumption, the dynamic characteristics of Java have a choice, there are still widespread practice of distance, but the future can be expected.

2. repay some of the debts or JVM Java language design implementation

Java design, in addition to the original data type, all other objects. Object header, polymorphic support overhead significantly, which for data (the Data), often a overhead, rather than the data itself needs.

At the same time, a reference to the complex relationship bring the complexity of the memory layout, it is difficult to take full advantage of modern CPU's cache structure. Java language is also difficult to efficiently, gracefully express part of the complex data structures and paradigms, these issues will be resolved in Valhalla and other projects.

3. Java syntax change

Java syntax gradual evolution in Amber / Valhalla other projects to improve the efficiency and code quality code development. JAVA 10 such as a local variable type inference provided by simplifying coding context inference improve code readability. Preview is the follow-up phase of the Switch Expression and so on, through the expression instead of statements, greatly improving language skills, improve development efficiency and good practice.

4. Operation hardware level capacity building

OpenJDK faster and more directly to operate the hardware level, mainly for Panama and other projects in development. Operators such as providing support for the force of large data, machine learning, better ability to interact with native code, vector computing capabilities, and so on.

Of course, there are concurrent programming and improve the operating efficiency of the Project Loom, the introduction of Fiber / Continuation, Java Concurrency solution to the current development / operational efficiency. As well as the much-anticipated Pauseless GC and so on, the whole JVM is becoming more intelligent and efficient.

Third, the large field of practice and development data

As you know, the mainstream big data technology stack, either based on Java, either run on JVM. Java and JVM provide ease of use of grammar, cross-platform capabilities, a wide range of tools, libraries, and so on, let the JVM has become the uncrowned king of big data field, almost no equal competitors for now.

But when we are really developing and maintaining large scale clusters, mass data processing, JVM gradually some of the limitations.

For example, in the mainstream of Hadoop technology stack, heap size NM peers directly affects the size of the cluster and data, GC stability is closely related to the SLA, the current JVM GC in piles aspects, but also far from perfect, the need for further improvement.

From the point of view of large data load characteristics, classic GC algorithm has a certain degree of acclimatization. We know that the current era, such as design, this is a practice based on experience, "most of the objects are small and life is short," but, in Spark SQL and other large data loads, often can see a lot of long-lived large objects and even large object allocation .

Large object allocation and initialization costly, but also in the G1 GC, etc. Region-based design, meets or exceeds Region 50% the size of the object will occupy one or more partitions, the remainder were wasted space, which limits the precious memory resource efficiency.

From a business point of view the characteristics of big data, JVM mechanisms also need to revise and improve targeted. For example, large data traffic is equivalent to step off-line calculation timing, at a time of day, application behavior changed greatly, and the current characteristic of the adaptive JVM acclimatized not uncommon occurrence, G1 GC prediction engine continued prediction failure resulting from GC long pause, sometimes hurt SLA, targeted improvements is essential.

Because above all the limitations, many large data framework, had effortlessly to operate outside the pile, bringing the efficiency of research and development and operation and maintenance.

We note that in the production practice, big data applications JNI enter the critical section, GC Locker trigger frequent and pointless Young GC large object allocation together, can lead to unexpected JVM OOM, this problem more frequently in big data scene, specifically, Referring next to FIG.

In addition, no matter how large data or machine learning, after all, can not escape from a core, which is considered force.

Large machine-dependent data (cluster), the thread (polynuclear) and instructions (SIMD) data of three levels of parallel computing. Probably after the year 2002, CPU Core frequency has basically no significant increase, even decreased, the production workload scalability increasingly dependent on heap CPU, heap machine. Distributed clustering and multi-threaded essential, but at the JVM level optimization level of instruction have not yet received adequate attention, take advantage of instruction parallelism is considered one of the force protection.

JVM to quantify / SIMD usually have three means :

First, JNI direct use of native code, but because of varied reasons such as CPU is very difficult to develop and maintain.

Second, to develop their own JVM Intrinsic, which is not unrealistic for ordinary developers

Third, the use of Auto-Vectorization capabilities provided by the JVM, it is more feasible.

Auto-Vectorization but also a lot of capacity limitations, currently available only in the C2 SupperWord Optimization, dependent on Counted Loop the Loop Unrolling, development difficult and fragile.

Currently, OpenJDK hatching Vector API can greatly improve development efficiency, and to improve performance in the future we will actively promote their development and maturation.

In the big data scene diagnostic and tuning aspects of internal Kona integrated Java Flight Recorder (Oracle open source) provides a full stack JVM Profiling ability to produce environment available and offers the possibility can not Heap Dump diagnostic memory leak, which for mass distribution Trunking, frequent piles / large pile of big data scene, very helpful.

Also helps us further understand the overall cost of generic large data loads, such as serialization / de-serialization, memory (target) distribution, and so on.

We will enhance the community together jmap SVC and other tools to optimize general overhead, back to the community and share our experiences.

四、Q&A

Q: Yang, I would like to ask Tencent cloud now do Open JDK and Kona JDK What is the difference? What advantages does this have to do?

A: Kona JDK is the main branch of Open JDK, improved based on the needs and pain points of practical application scenarios massive micro service, Serverless, big data, etc., targeting to provide the best scene in the corresponding Java runtime environments and solutions based.

Kona JDK will try to upstream features to maximize Java ecosystem benefits, and on selected characteristics, give full consideration to compatibility, maturity and production values, emphasized in order to bring tangible efficiency and productivity. If you do have individual features or bug fix, does not conform to the universal standard upstream of the main branch, work plans, etc., but as long as there is a big production of meaning, Kona may still choose to provide.

I think, from the Java ecosystem health perspective, Kona JDK meaning and purpose are not that different, but in the long-term, reliable support and accelerate the production of new features OpenJDK speed, enhance the scientific data, the basis of the strength of Java cloud computing and other fields.

Instructors

Yang Xiaofeng , Tencent senior technical experts, China Association for Computing Machinery (CCF) system software IPCC members, currently in charge of TEG JDK team, OpenJDK Committer. He has led Oracle Java Platform Beijing team core libraries, data base platform Jingdong R & D team and other intelligent systems, produced the column "Java core technology 36 stresses", focused on Java / JVM software and other infrastructure at the forefront of the field of big data, cloud computing, etc. Evolution and practice.


Guess you like

Origin juejin.im/post/5e6f2da16fb9a07cc97db787