ロブ・ペイズリー：

私は2番目〜975個のパケット/（長さが1038オクテット）でマルチキャストUDPを読み込みRHEL 7.4上で動作するJavaのOSGi（Apacheのフェリックス）アプリケーションがあります。その後、境界デバイス横切るシミュレートし、XMLにデータを変換して、UDPマルチキャストパケットに戻ってそれを変換します。複数のスレッド関与しており、それがシミュレート境界デバイスは1つのペイロードを処理するために時間がかかる場合は、それをバッファして、より大きなペイロード次回を送信するように書かれているがあります。

この統合テスト・シナリオを介してパケットのレイテンシを見ると、二つの異なるデスクトップグレードのマシンはかなり速く我々が展開することを期待かなり高いエンドサーバーよりもあります。

サーバー待ち時間5秒。HW：デュアルXeonプロセッサ[email protected]、128G RAM、16物理、21個の論理コア、RAID 1台のSASのSSD。
デスクトップA <1秒。HWのXeon [email protected]、64G RAM、4物理的、8個の論理コア、500G SSD
デスクトップB <1秒。HW [email protected]、16G RAM、4物理的、8個の論理コア、1TB 7200RPMドライブ。

このアプリケーションは、ディスクへの書き込みをしないように私は唯一の完全性のために、ハードドライブに言及します。紙の上にサーバが2デスクトップの速さとして少なくとも実行する必要があります。

私が解消しました物事：

ネットワークカード。私はNICの間に有意な差があります念の物理NICとダミーデバイスの両方でテストしてみました。
論理コアの数。私は、変数を排除するための努力で、サーバの論理コアの16と24を無効にすることを試みました。
Javaのバージョン。すべての3つは、OpenJDKのと同じバージョンのOracleのJava（Javaの両方がで試された1.8.0）同じ結果を得ました。
Javaのフラグは同一であり、全て（ディレクトリ、構成プロパティ、および実行するためのjarファイルをインストール）フェリックスに関する。
SELinuxの。私はすべての3つのモード（許可、強制、無効）でそれを試してみました。私はここでの違いを期待していなかったが、私はこの時点では何のために把握しています。
カーネルのバージョン。私は反対のテストを試してみた3.10.0、4.13.0と4.15.0同様の結果で。

ark.intel.comプロセッサの比較

ここでは、問題を説明するための2つのサンプルのグラフです。このテストでは、マルチキャストアドレスAに4分10秒を越え260960個のUDPパケットを送信し、それはアプリケーションを介して処理されていた後、パケットがB.は、マルチキャストアドレスに送信されtcpdump、両方のタイムスタンプを記録し、減算は、待ち時間が得られます。すべての3つのアプリケーション（送信者、アプリケーションは、tcpdump同じマシン上にあります）。

第1のダミーインターフェースに対するサーバーのハードウェア

ダミーインターフェースに対するi7のデスクトップハードウェア

Y軸スケールの違いに注意してください。デスクトップは0-1秒であるI7 Serverは、0-4秒です。読みにくい表示されるX軸はパケット数です。

次の試み

I was running a local integration verison of the application. I then eliminated almost 100% of the work begin done by the application and saw growing latencies on the server hardware. I then tried -Xmx100G -Xms100G essentially to keep the garbage collector from running EVER and saw the following results (< 1 second consistent latency).

巨大なメモリ

Which led me to Java 8's Available Garbage Collectors.

Default Garbage Collector select on the server hardware was New: ParallelScavenge, Old: ParallelOld. Here's the resulting latency graph without the XML conversion, as simple a test as I could make it to duplicate the issue.

デフォルトのGC

Explicitly selecting the Garbage First Garbage Collector -XX:+UseG1GC selected New: G1New, Old: G1Old and it's resulting latency graph wasn't great:

G1 GC

Explicitly selecting the Concurrent Mark Sweep Garbage Collector -XX:+UseConcMarkSweepGC selected New: ParNew, Old: ConcurrentMarkSweep and it's resulting latency graph looked excellent:

CMS GC

It appeared like the problem was solved. Once I added all the components back into place, I'm still getting unacceptable latencies. I'm still running tests to see if I can isolate the issue.

Strace Results

Trying strace -c -o /path/to/file -f yielded the following top system calls

First the i7's desktop strace report (truncated at the top 10 items)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 93.71 1418.604132         959   1479659    134352 futex
  1.74   26.294223      730395        36           poll
  1.74   26.288786         314     83645         4 read
  1.41   21.373672          73    293618           epoll_pwait
  1.19   17.952475         120    149854         2 recvfrom
  0.10    1.448453           2    909731           getrusage
  0.06    0.896903           3    281407           sendto
  0.03    0.394695           2    198041           write
  0.01    0.182809          10     18246           mmap
  0.01    0.120735           6     20582           sched_yield

Now for the server's strace report:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.46 2119.311196        2642    802183    131276 futex
  1.28   27.734136     6933534         4           poll
  0.59   12.840448          49    263597           epoll_wait
  0.41    8.885742         113     78387         2 recvfrom
  0.07    1.575401           6    263671           sendto
  0.07    1.515999           6    262256           epoll_ctl
  0.04    0.902788          54     16800           sched_yield
  0.03    0.743231          10     75455           write
  0.02    0.490052           6     84509         7 read
  0.01    0.170152           4     42732           lseek

I'm unclear what I should conclude from this. The desktop is many times faster in both the futex and the poll system call. I still don't understand why the application is so much more latent on the faster hardware.

Profiling

I've profiled the software on both pieces of hardware showing similar locations for hotspots which seems to rule that out.

Rob Paisley :

I confirmed I was using the performance CPU governor with RedHat: CPUfreq Coverners

I ran across a VMWare ESXi report of problematic BIOS settings Virtual Machine Application runs slower than expected on EXSi

Which pointed directly to my answer. The default on this Dell R630 was "Performance Per Watt (DAPC)" (DAPC: Dell Active Power Controller). Switching to "Performance" fixed this issue entirely. The machine felt much snappier at the console, and latencies were much lower than the desktop was able to achieve which was what I expected given the CPU differences.

Steps to change the BIOS on a Dell R630 (and likely others) on startup:

F2 to enter System Setup
Select "System BIOS"
Select "System Profile Settings"
Ensure first entry is set to "Performance" default is "Performance Per Watt"
Select "Back"
Select "Finish"
Select "Yes" to save changes with system reset
Select "OK" to the settings were saved successfully

Here's the resulting latency graph(s), they're using the same 1 second scale.

Default GC on the server(s): DefaultGCは

Concurrent Mark Sweep GC on the server(s): CMSGC

First Generation GC on the server(s): G1GC

Not much difference between the G1GC and the CMSGC, but both are clearly better latency than the default (which is expected).

Graphs of Logical Core Clock Speed

シンボルは見えにくいですが、これらの二つのグラフ上の32個の異なる点があります。全体的に、あなたはすぐにパフォーマンスがあった1伝えることができ、1つは、ワット当たり性能-DAPCがありました。

パフォーマンスパーワット（DAPC）： PerformancePerWattDAPC

演奏

一緒にプロットしました。赤の弾丸、パフォーマンスパーワットブルーオープンサークル内でのパフォーマンス

これは、それに応じてBIOSセットとデータフローの300秒の間に捕獲されました。ここで私は誰もが知りたい場合には、データをキャプチャ方法は次のとおりです。

for i in `seq 300`; do
  paste /sys/devices/system/cpu/cpu[0-9]*/cpufreq/cpuinfo_cur_freq
  sleep 1
done > performance.log

どのようなソフトウェアは、より良いハードウェアに遅い実行する可能性がありますか？

次の試み

Strace Results

Profiling

Graphs of Logical Core Clock Speed

おすすめ