How to use the performance testing tool Lmbench and analyze the running results

1. Introduction to Lmbench

LmbenchIt is a simple and portable memory test tool. Its main functions include bandwidth evaluation (read cache files, copy memory, read/write memory, pipeline, TCP), delay evaluation (context switching, network, file system establishment) and deletion, process creation, signal processing, upper-level system calls, memory reading reaction time) and other functions.

2. Download and install

Official website address: http://www.bitmover.com/lmbench/
Download link: lmbench-3.0

imaginemiracle:Downloads$ unzip lmbench-3.0-a9.zip

It should be noted that all the files lmbenchin cannot be executed, and a series of errors suchmake as there will be seen after the direct execution of the compilation .Permission denied

Here you first need to change the permissions of all files:

imaginemiracle:Downloads$ sudo chmod 777 -R lmbench-3.0-a9/

Enter lmbenchthe directory , its directory structure is as follows.

imaginemiracle:Downloads$ cd lmbench-3.0-a9/
imaginemiracle:lmbench-3.0-a9$ ls
ACKNOWLEDGEMENTS  CHANGES    COPYING    doc              Makefile  results  src
bin               ChangeSet  COPYING-2  hbench-REBUTTAL  README    scripts

3. Use Lmbench test

Execute make results, after execution, the following options will be prompted to be set:

  • MULTIPLE COPIES: Run parallel tests at the same time, corresponding to scal loaditems ;
  • Job placement selection: job scheduling control method, selected by default 1, indicating that job placement is allowed;
  • Options to control job placement: selected by default 1;
  • Memory: Set it to be greater than 4times cache size, the larger the value, the more accurate the result, and the longer the running time;
  • SUBSET: The subset to run, including ALL / HARWARE / OS / DEVELOPMENT, default selected all;
  • FASTMEM, SLOWFS, DISKS, REMOTE... and other options can be kept as default.

After the setup is complete, the test program starts to run. You need to pay attention to the long running time, and you need to wait patiently, or do other things first 10 minand then .

4. View the results

Execute make seeto view the running results. If only two lines of commands appear, indicating that the running results are output summary.outto the file, you can directly view the file. cat ./results/summary.out.
You will see the following output:

insert image description here

4.1. Basic system information

The basic parameter information of the system starts to be displayed in the output result.
insert image description here
in:

  • tlb: Indicates the number of pages in the translation look-aside cache;
  • cache line bytes: cache line bytes
  • mem par: memory hierarchical parallelization;
  • scal load:Lmbench the number of parallel executions .

4.2. Processor Performance

The units of the following output results are all us, and the smaller the value, the better the performance.
insert image description here

  • null call:getppid the time required to execute ;
  • null I/O:/dev/zero the time to read a byte from t1, the time to write a byte /dev/nullto t2, t1、t2and the average value is the result of this item;
  • stat: stat the time required for a file (that is, to get the information of a file);
  • open clos: open the total time it takes to open a file and then closedelete the file (excluding the time for reading directories and nodes);
  • slct TCP: the time consumed by selecting a file descriptor through TCPa network connection;100
  • sig inst: install signal the time spent;
  • sig hndl: handler signal the time spent;
  • fork proc: fork a completely the same process, and the total time consumed by processturning off ;
  • exec proc: Simulate the working process of a shellprocess : forkthe time it takes for a new process to execute a new command.
  • sh proc: fork A process that also asks the system how long it takes shellto find and run a new program.

4.3. Mathematical operations

The units of the following output results are all ns, and the smaller the value, the better the performance.

(1) Integer calculation

insert image description here
(2) Unsigned integer calculation

insert image description here
(3) Floating point calculation

insert image description here
(4) Double-precision floating-point calculation

insert image description here

4.4. Context switching

The units of the following output results are all us, and the smaller the value, the better the performance.

insert image description here
Multiple processes are connected with unix pipea ring , and each process reads from its own pipe token, performs tasks, and then tokenwrites to the next process.

context swithingThe time includes: the time to switch processes, plus the time to restore all state of the process (including the restored cachestate ).

  • 2p/0k: Each process sizeis 0(does not perform any tasks), and the time consumed by context switching 2when ;
  • 2p/16k: Each process sizeis 16K(executing tasks), and the time consumed by context switching 2when ;

Subsequent test items and so on.

4.5. Local communication delay

The units of the following output results are all us, and the smaller the value, the better the performance.

insert image description here

  • 2p/0k: Each process sizeis 0(does not perform any tasks), and the time consumed by context switching 2when ;
  • Pipe: The so-called hot potatotest , using pipecommunication between two processes without specific tasks, tokenone is passed back and forth between the two processes, and the average time spent back and forth is passed;
  • AF UNIX: the same as Pipethe test item, but the inter-process communication uses socketcommunication;
  • UDP: Same as Pipethe test item, but the inter-process communication uses UDP/IPcommunication;
  • RPC/UDP: Same as Pipethe test item, but the inter-process communication uses sun RPCthe communication, and by default, the protocolRPC is used for transmission;UDP
  • TCP: Same as Pipethe test item, but the inter-process communication uses TCP/IPcommunication;
  • RPC/TCP: Same as Pipethe test item, but the inter-process communication uses sun RPCthe communication, RPCspecifying TCPthe protocol transmission;
  • TCP conn: The time it takes to create socketthe descriptor and establish the connection.

4.6. File and memory delay

The units of the following output results are all us, and the smaller the value, the better the performance.

insert image description here

  • 0K File Create: 0K the time used to create the file;
  • 0K File Delete: 0K the time used for file deletion;
  • 10K File Create: 10K the time used to create the file;
  • 10K File Delete: 10K the time used for file deletion;
  • Mmap Latency: Put nthe mmapinto the memory, and then recordunmap the total consumption time of each and to get the maximum value of each consumption time;mmapunmap
  • Port Fault: Protection page delay time;
  • Page Faule: page fault delay time;
  • 100fd selct:100 Configure the time for file descriptors select.

4.7. Local communication bandwidth

The units of the following output results are all MB/s, and the larger the value, the better the performance.

insert image description here

  • Pipe: When two processes are established pipe, pipeeach chunkis the time it takes 64Kto move 50MBdata ;
  • AF UNIX: Establish unix stream socketa connection , each chunkis the time it takes to transmit64K data through this ;socket10MB
  • TCP: Same as Pipethe test item, but TCP/IP socketcommunication is used between processes, and the amount of transmitted data is 3MB;
  • File reread: the time taken to read the file and put it together;
  • Mmap reread: the time it takes to put the file mmapinto the memory, read the file from the memory and summarize it together;
  • Bcopy(libc): do bw_mem $i bcopy , the speed of copying the specified number of bytes from a specified memory area to another specified memory area;
  • Bcopy(hand): do bw_mem %i fcp , the time it takes to copy data from one location on the disk to another;
  • Mem read: bw_mem $i frd , accumulate the integer values ​​in the array, and test the bandwidth processorof ;
  • Mem write: do bw_mem $i fwr , set each member of the integer array to 1test the bandwidth of writing data to memory.

4.8. Memory Operation Latency

The units of the following output results are all ns, and the smaller the value, the better the performance.

insert image description here
The local test execution accumulates the value of eachlat_mem_rd element in the integer array ; the test is the bandwidth of reading data to .4processor

  • L1: Cache 1
  • L2: Cache 2
  • Main Mem: continuous memory
  • Rand Mem: memory random access latency
  • Guesses:
    If L1and L2are similar, it will be displayed “No L1 cache?”
    If L2and Main Memare similar , it will be displayed“No L2 cache?”

Guess you like

Origin blog.csdn.net/qq_36393978/article/details/125989992