The use of external memory optimization JVM GC heap of small note

The most recent project is a key service, due to the special nature of the business triggered a series of GC issues. After a short time does not attempt to track and ultimately perfect solution. The following record about the process and harvest.

Brief Background

The service is to provide goods sorting functions, business requirements are as follows:

  1. Goods are divided countries, different commodities for each country.
  2. Each item has a primary key field goodsId, and wherein there is a one-dimensional matrix, saved as a length of one-dimensional array 128 of the float.
  3. When sorting, providing query characteristic matrix A, and a number of alternative merchandise goodsId(up to 5000), and then take the matrix input Acharacteristic matrix and all candidate items multiplied by the matching score for each item, return .
  4. Product group needs to be updated regularly and updated collection of goods in each country separately.

Here you can see the special nature of this service, right: the greatest needs of each request found 5000 float [128] array! How this data is stored really is a problem.

We use the program is to create a large map in the memory, the structure is a Map<String, Map<String, float[128]>. Save the outer layer of the national mapping product group, inside the Map is goodsIdto map it feature matrix. We calculated that the amount of data, a rough estimate is that a single inner Map occupied memory about 350M, the entire external memory to account for a large Map of approximately 2GB.

To ensure appreciated, simple map illustrated as follows:

nation1:
  goodsId1: 特征矩阵1
  goodsId2: 特征矩阵2
  ...
nation2:
  goodsId1: 特征矩阵1
  goodsId2: 特征矩阵2
  ...
复制代码

Tell me what to'll certainly ask why we do not have a centralized cache such as Redis, but the data directly into memory?

Ah, before writing this article, I did a compression test, found that the performance of Redis really not so strong. For example, officials have claimed that a single instance of OPS 100000+, can really achieve, but what does that number mean? Means a get request requires 0.01ms, that a 1000-sized MGET need 10ms! This is not the case of network delay. I found the local (server and client are in the local physical and virtual machines) of MGET 5000 Ge key, delay 40-- between 60ms (under this scenario is not too great value, around 1kB, but also caused no significant performance decline). An article posted here: Redis performance fantasy and harsh reality

There is a train of thought is to use Redis plus local cache. But this scenario millions of pieces of data, and no hot spots, the local cache is difficult to be effective.

Closer to home. With this map, the main interface of the service easier to handle:

  • Input: accepts a nationparameter set goodsId, and a feature matrix of the query A, but also float [128]
  • According nationand goodsIdfound merchandise characteristic matrix, and then Amultiplied by the degree of matching score of the item.

The first edition effects: Under normal QPS, the average delay of less than 10ms.

Background has complete account. The following nightmare to begin ~

GC problems arise

Everything was perfect on the line. However, after running for a while, upstream service began to appear from time to time even blow a timeout, each lasting a very short time. We note that some TP99 index will peak service when the problem occurred after the investigation, as shown below:

Sometimes the response time will soar to nearly one second! Under Log nothing abnormal situation can only suspect that the GC at work, so he got GC logs to find out.

Observation GC logs

Gc log when the following paragraph taken -Xmx4g -Xmx4g parameters for the JVM, Java version: OpenJDK 1.8.0_212

{Heap before GC invocations=393 (full 5):
 PSYoungGen      total 1191936K, used 191168K [0x000000076ab00000, 0x00000007c0000000, 0x00000007c0000000)
  eden space 986112K, 0% used [0x000000076ab00000,0x000000076ab00000,0x00000007a6e00000)
  from space 205824K, 92% used [0x00000007b3700000,0x00000007bf1b0000,0x00000007c0000000)
  to   space 205824K, 0% used [0x00000007a6e00000,0x00000007a6e00000,0x00000007b3700000)
 ParOldGen       total 2796544K, used 2791929K [0x00000006c0000000, 0x000000076ab00000, 0x000000076ab00000)
  object space 2796544K, 99% used [0x00000006c0000000,0x000000076a67e750,0x000000076ab00000)
 Metaspace       used 70873K, capacity 73514K, committed 73600K, reserved 1114112K
  class space    used 8549K, capacity 9083K, committed 9088K, reserved 1048576K
4542.168: [Full GC (Ergonomics) [PSYoungGen: 191168K->167781K(1191936K)] [ParOldGen: 2791929K->2796093K(2796544K)] 2983097K->2963875K(3988480K), [Metaspace: 70873K->70638K(1114112K)], 2.9853595 secs] [Times: user=11.28 sys=0.00, real=2.99 secs]
Heap after GC invocations=393 (full 5):
 PSYoungGen      total 1191936K, used 167781K [0x000000076ab00000, 0x00000007c0000000, 0x00000007c0000000)
  eden space 986112K, 0% used [0x000000076ab00000,0x000000076ab00000,0x00000007a6e00000)
  from space 205824K, 81% used [0x00000007b3700000,0x00000007bdad95e8,0x00000007c0000000)
  to   space 205824K, 0% used [0x00000007a6e00000,0x00000007a6e00000,0x00000007b3700000)
 ParOldGen       total 2796544K, used 2796093K [0x00000006c0000000, 0x000000076ab00000, 0x000000076ab00000)
  object space 2796544K, 99% used [0x00000006c0000000,0x000000076aa8f6d8,0x000000076ab00000)
 Metaspace       used 70638K, capacity 73140K, committed 73600K, reserved 1114112K
  class space    used 8514K, capacity 9016K, committed 9088K, reserved 1048576K
}
复制代码

From the log can get some information:

  • JDK 8 without specifying gc embodiment, the default is used in a combination of Parallel Scavenge + Parallel Old. It is not actually CMS.
  • The reason for this is the Full GC's full of old, STW pause for three seconds ......

Gc policy adjustment

Since the emergence of GC problem, it is necessary to adjust the wave. Here are some to try what I did:

  1. The garbage collector into CMS
  2. Now that's old space is not enough, then give it space chant! The entire transfer large heap, the big old tune's memory.

The following are the results and conclusions:

  1. CMS also be replaced with no eggs, but even worse. CMS guess the reason is more dependent on the number of CPU cores, and very low we docker environment will limit the number of nuclei, leading to parallel processing of the CMS is not obvious. Because sometimes even old's memory is tight, there will Concurrent Mode Failureenter linear Full GC reveal all the details, time-consuming longer.
  2. After the heap memory increases, the number of ordinary occurrence Full GC and GC reduced, but the single-GC was more slow. Not solve the problem.

Then attach the crash site with CMS:

[GC (CMS Initial Mark) [1 CMS-initial-mark: 4793583K(5472256K)] 4886953K(6209536K), 0.0075637 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
[CMS-concurrent-mark-start]
03:05:50.594 INFO [XNIO-2 task-8] c.shein.srchvecsort.filter.LogFilter ---- GET /prometheus?null took 3ms and returned 200
{Heap before GC invocations=240 (full 7):
par new generation total 737280K, used 737280K [0x0000000640000000, 0x0000000672000000, 0x0000000672000000)
eden space 655360K, 100% used [0x0000000640000000, 0x0000000668000000, 0x0000000668000000)
from space 81920K, 100% used [0x0000000668000000, 0x000000066d000000, 0x000000066d000000)
to space 81920K, 0% used [0x000000066d000000, 0x000000066d000000, 0x0000000672000000)
concurrent mark-sweep generation total 5472256K, used 4793583K [0x0000000672000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 66901K, capacity 69393K, committed 69556K, reserved 1110016K
class space used 8346K, capacity 8805K, committed 8884K, reserved 1048576K
[GC (Allocation Failure) [ParNew: 737280K->737280K(737280K), 0.0000229 secs][CMS[CMS-concurrent-mark: 1.044/1.045 secs] [Times: user=1.36 sys=0.05, real=1.05 secs]
(concurrent mode failure): 4793583K->3662044K(5472256K), 3.8206326 secs] 5530863K->3662044K(6209536K), [Metaspace: 66901K->66901K(1110016K)], 3.8207144 secs] [Times: user=3.82 sys=0.00, real=3.82 secs]
Heap after GC invocations=241 (full 8):
par new generation total 737280K, used 0K [0x0000000640000000, 0x0000000672000000, 0x0000000672000000)
eden space 655360K, 0% used [0x0000000640000000, 0x0000000640000000, 0x0000000668000000)
from space 81920K, 0% used [0x0000000668000000, 0x0000000668000000, 0x000000066d000000)
to space 81920K, 0% used [0x000000066d000000, 0x000000066d000000, 0x0000000672000000)
concurrent mark-sweep generation total 5472256K, used 3662044K [0x0000000672000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 66901K, capacity 69393K, committed 69556K, reserved 1110016K
class space used 8346K, capacity 8805K, committed 8884K, reserved 1048576K
}
复制代码

Here, by the way some of pit encountered in the process, mainly to limit docker / k8s environment, some yet to be resolved. Later have the opportunity to specifically talk about it.

  • jstat Since Java process is 0, the process can not be specified.
  • OOM crash dump seems good collection.
  • VisualVM bad connection.

How to avoid problems GC

Question: Where is the problem?

The problem of GC is actually quite obvious: Due to the special nature of the business, we hold several large map object in memory, there is no doubt they will enter the old era. However, these objects and not immortal! Every so often, because the data needs to be updated, and there will be some new map objects are created out of the old map object loses references need to be recovered off the GC. Due to substantial increase in old's memory, we had to Major GC, and one-time to free up a lot of memory, especially this time is difficult to down low.

Since the problem lies in the large map object, that is naturally Solutions: Avoid using large map objects, or more precisely - do not put so much data into the heap memory.

Data is not placed in the heap, stack it or place outside (direct-memory process), or put out of process.

Of-process program is obviously a variety of databases; programs in the process? There are processes in the database (such as Berkeley DB) and the outer heap cache two kinds. The database is too heavy for me and, in fact, features a map of just what I want. So we decided to further research reactor outside the cache.

Another: About Java caching scheme will not repeat them here, citing "Tao school with open architecture" of a cache:

Java cache type

  • A heap buffer: Using java heap memory to store objects, the benefits are not required serialization / de-serialization, fast, disadvantage is subject to GC affected. You can use Guava Cache, Ehcache 3.x, MapDB achieve.
  • Outer heap buffer: cache data stored in the external reactor, breaking the shackles of the JVM, the read data requires serialized / deserialized, the stack cache is much slower compared. You can use Ehcache 3.x, MapDB achieve.
  • Disk cache: In the JVM restart when the data is still, and heap buffer / heap outside the cached data will be lost and need to be reloaded. You can use Ehcache 3.x, MapDB achieve.
  • Distributed cache: nothing to say, Redis ...

Heap external cache

Paste it directly to two articles:

Briefly summarize:

  1. A heap buffer within the huge amount of data in the GC can cause performance problems. Stack cache outer solvable.
  2. The principle heap outside the cache is Unsafelike direct manipulation process memory, you will need to control their own garbage collection, as well as serialization between Java objects and / deserialized, because only know to heap outside bytes, do not know Java objects.
  3. So useful and easy to implement functions would be best to resort to the frame. Support framework has mapdb, ohc, ehcache3 and so on. ehcache3 fees, ohc the fastest.

In summary, the decision to adopt ohc. Official website address here .

Code design

Ideas:

  1. Flexibility to consider using strategy pattern. Reference above, you do not know the external heap buffer end of the article.
  2. Since it is still used as a map, let packaged tools inherited Map interface. a ohc the OHCacheobject represents a heap outside the cache, I will package it as a map, a country store data. Naturally, there will be more programs OHCache.
  3. In addition, we note that OHCachethe class itself is inherited Closeableinterface, which is calling its Close()method can release its resources, namely the recovery of memory. Therefore, the package tools also need to inherit Closeable, and when the updated national data, call the original map objects being replaced Close()method, free up memory. Tested feasible.
  4. Since it is in use in Spring Boot, it is a caching framework, they want to adapt it to the Spring cache system in the past. It has not yet been implemented.

effect

The following figure shows improvement before , when used in the heap map, map update GC situation caused when data is refreshed:

We can see Young GC has been a long time, as well as Major GC. GC than the actual time on the chart (actuator index) is much higher, Major GC more than 1 second.

After using the external heap memory improvements, I would JVM heap memory piecemeal, leaving enough memory for the heap outside, the effect of:

  • The effect is actually the same as ah. . . Major GC there will still be a problem! Figure I is not posted.
  • The average delay from the original into a 10ms 40ms ......

This unscientific! Where are certainly not -

Further

Not here (xie) sell (bu) off (xia) sub (qu), the direct cause of the problem saying it.

A pit: read large files

The upgrade also made a change: update map data source is the database used by the changed files on s3, and these documents will be a big hundred MB. And we use CommonsIOthe readLines()method. Ah, the entire contents of the file it will be loaded into the heap, not GC strange!

After switching to the enumerator line, GC problem was finally gone. There is no longer Major GC.

Alternatively the map drawing is taking place:

Major GC number is zero Oh!

Pit II: Serialization

ohc you need to provide serialization key and value, passing a ByteBuffer. Due to ignorance, I once again use Apachethe serialized tool, the piles of the byte array object transitions by JDK serialization, then copied to ByteBufferthe.

The solution is directly operated ByteBuffer, the custom serialization. After modification, the delay problem solved.

final effect

TP99 stable at less than 13ms Oh! Bye donuts, oh no, worship glitch ~

Then attach a replacement process when change map total memory:

thanks for watching!

Reproduced in: https: //juejin.im/post/5cdf8df4f265da1bd260bae9

Guess you like

Origin blog.csdn.net/weixin_33722405/article/details/93178290
Recommended