Use transparent huge pages in JVM

"[JVM Anatomy Park][1]" is a continuously updated series of mini-blogs. It usually takes 5 to 10 minutes to read each article. Due to space limitations, only an in-depth explanation of a topic in accordance with questions, tests, benchmark procedures, and observation results. Therefore, the data and discussions here can be viewed as anecdotes, without checking for errors in writing style, syntax and semantics, duplication or consistency. If you choose to adopt the content of the article, you are at your own risk.


Aleksey Shipilёv, JVM performance geek   


推 特 [@shipilev] [2] 

  

Questions, comments, suggestions are sent to [[email protected]][3]


[1]:https://shipilev.net/jvm-anatomy-park

[2]:http://twitter.com/shipilev

[3]: [email protected]


2. Problem


What is a huge page? What is THP (Transparent Huge Page)? How can understanding it help us?


3. Theory


The concept of "virtual memory" has been widely accepted. Only a few people still remember "real mode" programming, not to mention the actual operation. Programming in this mode will use actual physical memory. Contrary to "real mode", each process has its own virtual memory space, and the virtual memory space is mapped to real memory. For example, two processes can store different data in the same virtual address `0x42424242`, and these data are actually stored in different physical memories. When the program accesses the address, a mechanism will convert the virtual address into an actual physical address.


This process is generally implemented by "[page table][4]" maintained by the operating system, and the hardware performs address conversion by "traversing the page table". Although it is easier to perform address translation in units of pages, it will bring a lot of overhead because the address translation occurs every time the memory is accessed. To this end, the introduction of [TLB (transition lookup buffer)] [5] cache the most recent conversion records. TLB is required to be at least as fast as the L1 cache, so it usually caches less than 100 entries. For heavy workloads, missing TLB and the resulting page table traversal takes a lot of time.


[4]:https://en.wikipedia.org/wiki/Page_table

[5]:https://en.wikipedia.org/wiki/Translation_lookaside_buffer


Although we cannot create larger TLBs, we can do something else: create larger memory pages! Most hardware provides 4K basic pages, 2M/4M/1G "large pages". Using a larger page to cover the same area can also shrink the page table, thereby reducing the page traversal time.


In the Linux world, there are at least two distinct ways to do this in an application:


-[hugetlbfs][6]. Cut a piece of system memory as a virtual file system, and the application can access it through mmap(2). This is a special interface that can be used only after the operating system and applications are configured at the same time. This is also an "all or nothing" process: the (persistent) space allocated for hugetlbfs cannot be used by other regular processes.


-[THP (Transparent Jumbo Page)][7]. The application can allocate memory as usual, but THP will try to transparently provide background large page storage support to the application. Ideally, there is no need to modify the application to enable THP, but we can see the application benefit from it. In fact, enabling THP will bring memory overhead or time overhead. The former may allocate entire large pages for some smaller content, and the latter sometimes requires memory defragmentation (defrag) because of THP allocation pages. The good news is that there is a compromise: the application calls madvise(2) to advise Linux where to enable THP.


[6]:https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

[7]:https://www.kernel.org/doc/Documentation/vm/transhuge.txt


我不明白为什么术语"large"和"huge"可以互换。 不管怎样,OpenJDK 支持两种模式:


```java
$ java -XX:+PrintFlagsFinal 2>&1 | grep Huge
 bool UseHugeTLBFS             = false      {product} {default}
 bool UseTransparentHugePages  = false      {product} {default}
$ java -XX:+PrintFlagsFinal 2>&1 | grep LargePage
 bool UseLargePages            = false   {pd product} {default}
```


`-XX:+UseHugeTLBFS` 把 Java 堆 mmaps 到独立的 hugetlbfs 中。


`-XX:+UseTransparentHugePages` 用 madvise -s 建议 Java 堆应该使用 THP。这是一个便捷选项,因为我们知道 Java 堆很大且大部分是连续的,并且极有可能因大页面受益。


`-XX:+UseLargePages` 是一种启用所有功能的快捷方式。在 Linux 上,该选项会启用 hugetlbfs 而不是 THP。我想这是历史的原因,因为 hugetlbfs 出现得更早。


一些应用程序在启用大页面时确实会[受到影响][8](有时会看到人们为了避免 GC 手动内存管理,结果却触发 THP 碎片整理进而导致延迟达到峰值)。我的直觉是 THP 在生命周期较短的应用程序上效果不佳,这些应用程序碎片整理耗费的时间与应用生命周期相比非常可观。


[8]:https://bugs.openjdk.java.net/browse/JDK-8024838


 4. 实验


能否举例展示大页面给我们带来的好处?当然可以。任何一位系统性能工程师在三十多岁时至少运行过一次类似这样的工作负载,分配并随机访问 `byte[]` 数组:


```java
public class ByteArrayTouch {

   @Param(...)
   int size;

   byte[] mem;

   @Setup
   public void setup()
{
       mem = new byte[size];
   }

   @Benchmark
   public byte test()
{
       return mem[ThreadLocalRandom.current().nextInt(size)];
   }
}
```


(完整源代码参见[这里][9])


[9]:https://shipilev.net/jvm/anatomy-quarks/2-transparent-huge-pages/ByteArrayTouch.java


我们知道数组大小各有不同,程序性能可能最终由 L1 缓存失败、L2 缓存失败或 L3 缓存失败决定。这里通常忽略 TLB 失败成本。


运行测试前,我们需要确定堆大小。我的电脑 L3 大约8M,所以100M数组足以超过。这意味着用 `-Xmx1G -Xms1G` 分配1G大小的堆就可以满足测试条件。同时,也可以参照这种方式确定 hugetlbfs 所需资源。


接下来,确保设置下列选项:


```
# HugeTLBFS 应该分配 1000*2M 页面:
sudo sysctl -w vm.nr_hugepages=1000

# THP 仅进行 "madvise" 建议(一些发行版本提供设置默认值选项):
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
```


我比较喜欢为 THP 做 "madvise",因为它允许我选择已经知道可能受益的特定内存。


在 i7 4790K、Linux x86_64、JDK 8u101 环境下运行:


```java
Benchmark               (size)  Mode  Cnt   Score   Error  Units

# Baseline
ByteArrayTouch.test       1000  avgt   15   8.109 ± 0.018  ns/op
ByteArrayTouch.test      10000  avgt   15   8.086 ± 0.045  ns/op
ByteArrayTouch.test    1000000  avgt   15   9.831 ± 0.139  ns/op
ByteArrayTouch.test   10000000  avgt   15  19.734 ± 0.379  ns/op
ByteArrayTouch.test  100000000  avgt   15  32.538 ± 0.662  ns/op

# -XX:+UseTransparentHugePages
ByteArrayTouch.test       1000  avgt   15   8.104 ± 0.012  ns/op
ByteArrayTouch.test      10000  avgt   15   8.060 ± 0.005  ns/op
ByteArrayTouch.test    1000000  avgt   15   9.193 ± 0.086  ns/op // !
ByteArrayTouch.test   10000000  avgt   15  17.282 ± 0.405  ns/op // !!
ByteArrayTouch.test  100000000  avgt   15  28.698 ± 0.120  ns/op // !!!

# -XX:+UseHugeTLBFS
ByteArrayTouch.test       1000  avgt   15   8.104 ± 0.015  ns/op
ByteArrayTouch.test      10000  avgt   15   8.062 ± 0.011  ns/op
ByteArrayTouch.test    1000000  avgt   15   9.303 ± 0.133  ns/op // !
ByteArrayTouch.test   10000000  avgt   15  17.357 ± 0.217  ns/op // !!
ByteArrayTouch.test  100000000  avgt   15  28.697 ± 0.291  ns/op // !!!
```


下面是一些观察结果:


  1. 对于较小的数组,缓存和 TLB 表现都很好,与基准测试没有显著差别。

  2. 在大数组情况下,缓存失败开始占主导地位,这就是为什么每种配置开销都在增加。

  3. 对于较大的数组,会出现 TLB 错误,启用更大的页面非常有帮助!

  4. `UseTHP` 和 `UseHTLBFS` 都能起到帮助,因为它们向应用程序提供了相同的服务。


为了验证出现 TLB 失败这一假设,可以查看硬件计数器。执行 JMH `-prof perfnorm` 会按操作输出统一结果。


```java
Benchmark                                (size)  Mode  Cnt    Score    Error  Units

# Baseline
ByteArrayTouch.test                   100000000  avgt   15   33.575 ±  2.161  ns/op
ByteArrayTouch.test:cycles            100000000  avgt    3  123.207 ± 73.725   #/op
ByteArrayTouch.test:dTLB-load-misses  100000000  avgt    3    1.017 ±  0.244   #/op  // !!!
ByteArrayTouch.test:dTLB-loads        100000000  avgt    3   17.388 ±  1.195   #/op

# -XX:+UseTransparentHugePages
ByteArrayTouch.test                   100000000  avgt   15   28.730 ±  0.124  ns/op
ByteArrayTouch.test:cycles            100000000  avgt    3  105.249 ±  6.232   #/op
ByteArrayTouch.test:dTLB-load-misses  100000000  avgt    3   ≈ 10⁻³            #/op
ByteArrayTouch.test:dTLB-loads        100000000  avgt    3   17.488 ±  1.278   #/op
```


好了!在基准测试中,每个操作都会发生一次 dTLB 加载失败,启用 THP 后会少得多。


当然,启用 THP 碎片整理后,在分配或访问时会有碎片整理开销。为了将这些成本转移到 JVM 启动阶段,避免应用程序运行中出现意料之外的延迟问题,可以让 JVM 在初始化时使用 `-XX:+AlwaysPreTouch` 访问 Java 堆中的每个页面。无论如何,为较大的堆启用 `pre-touch` 是一个好办法。


有趣的是: 实际使用中,启用 `-XX:+UseTransparentHugePages` 让 `-XX:+AlwaysPreTouch` 变得更快。因为 JVM 知道,现在它必须以更大的量程(比如每2M一个字节),而不是更小的量程(每4K一个字节)访问堆。启用 THP 进程死亡内存释放速度也会加快,这种粗暴的效果要等到并发内存释放补丁加入发行版内核才会结束。


例如,使用 4TB (Terabyte)大小的堆:


```java
$ time java -Xms4T -Xmx4T -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch
real    13m58.167s
user    43m37.519s
sys     1011m25.740s

$ time java -Xms4T -Xmx4T -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
real    2m14.758s
user    1m56.488s
sys     73m59.046s
```


提交和释放4TB肯定需要一段相当长的时间了。


5. 观察


使用大页面是一种提高应用程序性能的简单技巧。内核中 THP 让应用访问内存变得更加容易。JVM 中对 THP 的支持让选择大页面更方便。当应用程序拥有大量数据和大堆栈时,尝试使用大页面总是一个好主意。


Guess you like

Origin blog.51cto.com/15082395/2590381