The Definitive Guide to Java Performance - Summary 19

Java EE performance tuning

JVM thread tuning

Certain tuning strategies of the JVM can affect threading and synchronization performance.

Adjust thread stack size

When space is at a premium, the memory used by threads can be tuned. Each thread has a native stack, and the operating system uses it to save the thread's call stack information (for example, main()a method calls calculate()a method, and calculate()the method calls a add()method, and the stack will record this information).

Different JVM versions have different default thread stack sizes, as shown in the following table. In general, many applications can actually run with a 128 KB stack on a 32-bit JVM and 256 KB on a 64-bit JVM. If this value is set too small, the potential disadvantage is that if a thread's call stack is very large, it will throw StackoverflowError.
Default stack size for several JVMs:
insert image description here
On 64-bit JVMs, there is no reason to set this value unless physical memory is very limited and a smaller stack prevents running out of native memory. On the other hand, on a 32-bit JVM, it is often a good choice to use a smaller stack (such as 128 KB), because it can release some memory in the process space, so that the JVM heap can be larger.

耗尽原生内存
没有足够的原生内存来创建线程,也可能会抛出OutOfNemoryError。这意味着可能出现了以下3种情况之一。
1.在32位的JVM上,进程所占空间达到了4GB的最大值(或者小于4GB,取决于操作系统)。
2.系统实际已经耗尽了虚拟内存。
3.在Unix风格的系统上,用户创建的进程数已经达到配额限制。这方面单独的线程会被看作一个进程。

减少栈的大小可以克服前两个问题,但是对第三个问题没什么效果。遗憾的是,无法从JVM报错看出到底是哪种情况,只能在遇到错误时依次排查。
要改变线程的栈大小,可以使用-Xss=N标志(例如-Xss=256k)。

quick summary

  1. On machines where memory is scarce, the thread stack size can be reduced.
  2. On 32-bit JVMs, the thread stack size can be reduced to slightly increase the memory available to the heap within the 4GB process space limit.

Bias lock

When locks are contended, the JVM (and operating system) can choose how to allocate locks. Locks can be granted fairly, and each thread acquires the lock in a round-robin manner. There is also a solution, that is, the lock can be biased towards the thread that accesses it most frequently.

**The theory behind biased locks is that if a thread has recently used a certain lock, the data the thread needs next time it executes code protected by the same lock may still be held in the processor's cache. If the thread is given priority to acquire the lock, the cache hit rate may increase. ** Performance will improve if this is implemented. But because biased locks also require some bookkeeping information, performance can sometimes be worse.

In particular, applications that use a certain thread pool (including most application servers) will have worse performance when biased locks are in effect. In that programming model, different threads have equal opportunities to access contended locks. For these types of applications, -XX:-UseBiasedLockingdisabling biased locks using the option will slightly improve performance. Biased locks are enabled by default.

spin lock

When dealing with the contention problem of synchronization locks, the JVM has two choices. For a thread that is blocked trying to acquire a lock, it can enter a busy loop, execute some instructions, and then check the lock again. It is also possible to put this thread into a queue and notify it when the lock becomes available (making the CPU available to other threads).

A busy loop (so-called thread spinning) is much faster than the alternative if the locks contended by multiple threads are held for a short period of time. If it is held for a long time, it is better to have the second thread wait for the notification, and this way the third thread also has a chance to use the CPU. The JVM will seek a reasonable balance between these two situations, and automatically adjust the spin time before handing over the thread to the queue to be notified. There are some parameters to adjust the spin time, but most are experimental and are subject to change, even with minor version updates.

If you want to affect the way the JVM handles spinlocks, the only reasonable way is to keep the synchronization block as short as possible; of course, it should be done no matter what the situation. This limits the amount of spins that are not directly related to the program's functionality, and also reduces the chance of a thread being queued for notifications.

UseSpinning标志
之前的Java版本支持一个`-XX:+UseSpinning`标志,该标志可以开启或关闭自旋锁。在Java 7及更高版本中,这个标志已经没用了:自旋锁无法禁用。不过考虑到向后兼容,
Java 7到7u40这些版本的命令行参数仍然接受该标志,但是不执行任何操作。有点奇怪的是,这个标志的默认值会报告为false,即使自旋锁一直在发挥作用。

从Java7u40(以及Java8中)开始,Java不再支持该标志,使用这个标志会报错。

thread priority

Each Java thread has a developer-defined priority, which is a clue that the application provides to the operating system as to how important a particular thread is to it. If you have different threads working on different tasks, you might think that you can use thread priority to improve the performance of a particular task at the expense of other tasks running on lower priority threads. Unfortunately, it won't be that useful in practice.

The operating system calculates a "current" priority for each thread running on the machine. The current priority takes into account the priority assigned by Java, but also takes into account many other factors, the most important of which is: the time since the thread last ran. This ensures that all threads have a chance to run at some point. No matter the priority, no thread will be "starved" all the time, waiting to access the CPU.

The balance between these two factors varies from operating system to operating system. On Unix-based systems, the calculation of the overall priority depends mainly on the duration of the thread's last run to the present, and the priority specified by the Java layer has little effect. On Windows systems, threads with higher priority specified at the Java layer tend to run longer than threads with lower priority; but even with lower priority, those threads will get a relatively fair execution time.

However, in either case, the priority of a thread cannot be relied upon to affect its performance. If some tasks are more important than others, application layer logic must be used to prioritize. To some extent, this can be solved by assigning tasks to different thread pools and modifying the size of those pools.

summary

Understanding how threads work can yield significant performance benefits. As far as thread performance is concerned, though, there isn't really much that can be tuned: there are quite a few JVM flags that can be modified, and the effects of those flags are limited.

Instead, better thread performance comes from following a set of best-practice principles for managing the number of threads and limiting the impact of synchronization. With the proper profiling and lock analysis tools in place, applications can be inspected and modified to avoid thread and lock issues that negatively impact performance.

Java EE performance tuning

Basic performance of web containers

The key to Java EE application server performance is the web container, which handles HTTP requests through basic servlets and JSP pages. There are some basic approaches to improving the performance of Web containers, and the specific methods of improvement vary by Java EE implementation, but some concepts can be applied to all servers.

Reduced output
Reducing the resulting output produced by the server can speed up the return of Web pages to the browser.

Reduce spaces
Do not write extra spaces when calling PrintWriter in servlet code, because spaces also take time to transmit over the network (and, compared to code processing, network transmission time is more important). Should be used print()instead println(), mainly to avoid writing tabs or spaces in the returned HTML. While this does confuse some people looking at the source code of a Web page, if they're really interested in the source code, they'll always use an XML or HTML editor. Whitespace can also be handled by internal QA or performance optimization teams. Undoubtedly, structured page source code can simplify debugging, but in order to improve the response time of the application, it must finally be loaded into a format editor to remove excess whitespace. Most application servers can automatically remove spaces in JSP pages. For example, the trimSpaces command in Tomcat (and the open source Java EE server based on Tomcat) can remove the spaces before and after each line of the JSP page. So you can develop and maintain JSP pages with proper indentation without worrying about unnecessary spaces being transmitted over the network.

Merging CSS and JavaScript resources
For developers, it makes sense to keep CSS in a separate file, and it is easier to maintain. The same is true for JavaScript. But when using these resources, transferring one large file is more efficient than transferring several small files. Java EE has no standard for this, and most application servers can't handle it automatically, but there are development tools that can help you combine these resources.

Compressing output
From the user's perspective, the longest time to execute a web request is usually the time it takes for the server to send HTML back to the browser. But since the client (simulated browser) to server performance test is usually carried out in a fast local area network, this time is usually not the longest. While real users may be on a "fast" WAN, it's still an order of magnitude slower than a LAN between machines in your lab. Most application servers have a compression mechanism when sending data back to the browser: HTML data is sent compressed to the browser with a content type of zip or gzip. This only works if the original request indicates that the browser supports compression . All modern browsers support this feature. Enabling compression requires more CPU cycles on the server, but generally the smaller the amount of data, the less time it takes to transfer over the network, resulting in higher overall performance. Unlike the other optimizations discussed in this section, however, it doesn't always improve performance. The latter example shows that performance may suffer when compression is turned on on a LAN. The same is true when applications send very small pages (although most application servers allow compression only if the output is larger than a certain size).

字符串是否应该预编码?
应用服务器在字符转换上要花费大量时间:从Java的String对象(以UTF-16格式保存)转换成客户端所需要的字节数组。许多这样的字符串总是相同的。
Web页面的HTML字符串并不会总随着数据发生变动(如果发生了,它们也仍然是从字符串常量集合中获取的)。

字符串是否预先编码取决于服务器:有些服务器会对此提供一个选项,有一些则是自动执行的。

在servlet中,这些字符串可以预编码,然后用ServletOutputStream的write()通过网络发送,不要用PrintWriter的print()。不过动态数据仍然要用print()才能正确编码。
(可以从header中找到目标编码,然后对字符串编码,但这种方法相对容易出错。)

应用服务器实现这些输出接口以及在其内部缓存这些数据的方式有很大差别。对一些服务器来说,混用servlet的输出流(output stream)和它的小伙伴print writer会导致频繁刷新网络缓存。

从性能优化角度看,频繁刷新缓存是非常昂贵的操作——比重新编码这些数据更昂责。与此类似,对一大块数据进行编码的代价通常不会比一小块数据高很多:最主要的代价是建立到编码器的调用。因此,对小段动态数据来说,频繁地编码及发送编码后的字节数组会拖慢应用:多次调用编码器所花费的时间,比一次调用编码所有的东西(包括静态数据)要长。
代码的预编码在某些情况下有一定作用,但要视情况而定。

Compared with the test, the performance of these optimization measures in the actual operation will be very different. The table below shows the possible results. The output generated by the stock history servlet used in the test is relatively long, and the data range obtained is 10 years. The result produced is an uncompressed and unstripped HTML page, approximately 100 KB. To minimize the impact on bandwidth, the test only runs for a single user, with a think time of 100ms, and then measures the average response time of the requests. When using LAN, the tests were run on a local network through a 100MB switch; when using broadband, the tests were run over the cables in the home (an average download speed of 30 Mb/s). When using a WAN connection from a public WiFi connection at a local coffee shop - speeds are quite unreliable (average sample over 4 hours is shown in the table).
The effect of several web response output size optimizations under different network conditions:
insert image description here

This table emphasizes the importance of testing in the actual deployment environment of the application. If you only perform test tuning in a laboratory environment, more than half of the performance you get will be unreliable. Although the tests in this example are actually run on a remote application server (using a public cloud service), a hardware emulator can simulate a lab environment and control all relevant machines. (The cloud service machines are also faster than the LAN machines; the number of machines between them is not directly comparable.)

quick summary

  1. Test your Java EE applications on the network infrastructure they actually run on.
  2. The external network is still slower than the internal network. Limiting the amount of data written by the application can achieve good performance.

Guess you like

Origin blog.csdn.net/weixin_42583701/article/details/131427067