Detailed explanation of mmap for Linux memory management

Detailed explanation of mmap for Linux memory management

1. mmap system call

1.  mmap system call    

    mmap maps a file or other object into memory. The file is mapped on multiple pages. If the size of the file is not the sum of the sizes of all pages, the unused space of the last page will be zeroed out. munmap does the opposite, removing object mappings for specific address regions.

After using mmap to map the file to the process, you can directly operate this virtual address to read and write the file, without calling system calls such as read and write. However, it should be noted that when writing directly to this segment of memory, it will not be written. Content that exceeds the current file size.

An obvious benefit of using shared memory communication is efficiency, because processes can read and write memory directly without copying any data. For communication methods such as pipes and message queues, four copies of data are required in kernel and user space, while shared memory only copies data twice: once from the input file to the shared memory area, and once from the shared memory area to the output file. In fact, when sharing memory between processes, it is not always unmapped after reading and writing a small amount of data. When there is new communication, the shared memory area is re-established. Instead, the shared area is kept until the communication is completed. In this way, the data content is kept in the shared memory and is not written back to the file. The contents of shared memory are often written back to the file when unmapped. Therefore, the communication method using shared memory is very efficient.  

    For file-based mapping, the st_atime of the mapped file may be updated at any time during the execution of mmap and munmap. If the st_atime field has not been updated in the preceding case, the value of this field will be updated when the first page of the mapped area is indexed for the first time. For file mappings established with the PROT_WRITE and MAP_SHARED flags, the st_ctime and st_mtime are updated after the mapping area is written, but before msync() is called with the MS_SYNC and MS_ASYNC flags.

usage:

#include 

void *mmap(void *start, size_t length, int prot, int flags,

int fd, off_t offset);

int munmap(void *start, size_t length);

Return description:

On successful execution, mmap() returns a pointer to the mapped area, and munmap() returns 0. On failure, mmap() returns MAP_FAILED[whose value is (void *)-1], and munmap returns -1. errno is set to one of the following values

EACCES: access error

EAGAIN: The file is locked, or too much memory is locked

EBADF: fd is not a valid file descriptor

EINVAL: One or more arguments are invalid

ENFILE: The system's limit on open files has been reached

ENODEV: The file system where the specified file is located does not support memory mapping

ENOMEM: Out of memory, or the process has exceeded the maximum number of memory maps

EPERM: Insufficient capability, operation not allowed

ETXTBSY: Open the file as written, and specify the MAP_DENYWRITE flag

SIGSEGV: Attempt to write to read-only area

SIGBUS: Attempt to access a memory area that does not belong to the process

parameter:

start: The start address of the mapping area.


length: The length of the mapping area.


prot: The expected memory protection flag, which must not conflict with the file's open mode. is one of the following values, which can be reasonably combined by the or operation

PROT_EXEC //Page content can be executed

PROT_READ //Page content can be read

PROT_WRITE //Page can be written

PROT_NONE //The page is not accessible


flags: Specifies the type of mapping object , mapping options and whether mapping pages can be shared. Its value can be a combination of one or more of the following bits

MAP_FIXED //Use the specified mapping start address. If the memory area specified by the start and len parameters overlaps the existing mapping space, the overlapping part will be discarded. If the specified start address is not available, the operation will fail. And the start address must fall on a page boundary.

MAP_SHARED //Share the mapping space with all other processes that map this object. Writing to the shared area is equivalent to outputting to a file. The file will not actually be updated until msync() or munmap() is called.

MAP_PRIVATE //Create a copy-on-write private map. Writing to the memory area does not affect the original file. This flag and the above flags are mutually exclusive, only one of them can be used.

MAP_DENYWRITE //This flag is ignored.

MAP_EXECUTABLE //Same as above

MAP_NORESERVE //Don't reserve swap space for this map. While swap space is reserved, modifications to the mapped area may be guaranteed. When swap space is not reserved and memory is insufficient, modifications to the mapped area will cause a segment violation signal.

MAP_LOCKED //Lock the pages of the mapped area to prevent pages from being swapped out of memory.

MAP_GROWSDOWN //For the stack, it tells the kernel VM system that the mapped area can be extended downwards.

MAP_ANONYMOUS //Anonymous mapping, the mapping area is not associated with any file.

MAP_ANON //Another name for MAP_ANONYMOUS, which is no longer used.

MAP_FILE // Compatibility flag, ignored.

MAP_32BIT // Put the mapping area in the lower 2GB of the process address space, which will be ignored when MAP_FIXED is specified. Currently this flag is only supported on x86-64 platforms.

MAP_POPULATE //Prepare the page table by read-ahead for file mapping. Subsequent accesses to the mapped area are not blocked by page violations.

MAP_NONBLOCK // Only meaningful when used with MAP_POPULATE. No read-ahead is performed, only page table entries are created for pages that already exist in memory.


fd: A valid file descriptor. If MAP_ANONYMOUS is set, its value should be -1 for compatibility issues.


offset: The starting point of the content of the mapped object.


2.  System call munmap() 

#include 


int munmap( void * addr, size_t len ​​)  This call removes a mapping relationship in the process address space, addr is the address returned when mmap() is called, and len is the size of the mapping area. When the mapping relationship is released, the access to the original mapped address will cause a segmentation fault to occur. 


3.  System call msync() 

#include 


int msync ( void * addr , size_t len, int flags)  Generally speaking, the changes to the shared content of the process in the mapped space are not directly written back to the disk file, and this operation is often performed after calling munmap(). The content of the file on the disk can be consistent with the content of the shared memory area by calling msync(). 
 


2. The system  calls mmap() in two ways for shared memory: 

(1) Use the memory mapping provided by ordinary files: suitable for any process; at this time, you need to open or create a file, and then call mmap(); the typical calling code is as follows: 
 
  1. fd=open(name, flag, mode);
  2. if(fd<0)
  3.    ...
  4. ptr=mmap(NULL, len , PROT_READ|PROT_WRITE, MAP_SHARED , fd , 0);

There are many characteristics and points to be paid attention to in the communication method of shared memory through mmap()

(2) Use a special file to provide anonymous memory mapping: it is suitable for processes with kinship; due to the special kinship of the parent and child processes, mmap() is called first in the parent process, and then fork() is called. Then after calling fork(), the child process inherits the anonymously mapped address space of the parent process, and also inherits the address returned by mmap(), so that the parent and child processes can communicate through the mapped area. Note that this is not a general inheritance relationship. In general, the child process independently maintains some variables inherited from the parent process. The address returned by mmap() is jointly maintained by the parent and child processes. The best way to implement shared memory for related processes should be to use anonymous memory mapping. At this point, it is not necessary to specify a specific file, just set the corresponding flag .


Three.  The principle of mmap for memory mapping

     The ultimate purpose of the mmap system call is to map the device or file to the virtual address space of the user process, so that the user process can directly read and write the file. This task can be divided into the following three steps:

1. Find a free continuous virtual address space that meets the requirements in the user virtual address space to prepare for the mapping (completed by the kernel mmap system call)

       Each process has 3G bytes of user virtual memory space. However, this does not mean that the user process can use it arbitrarily within the 3G range, because the virtual memory space must eventually be mapped to a certain physical storage space (memory or disk space) before it can really be used.

       So, how does the kernel manage the virtual memory space of 3G per process? In a nutshell, the image file formed by the user process after compiling and linking has a code segment and a data segment (including the data segment and the bss segment), where the code segment is at the bottom and the data segment is at the top. The data segment includes all statically allocated data spaces, that is, global variables and all local variables declared as static. These spaces are the basic requirements necessary for the process. These spaces are allocated when the running image of a process is created. In addition, the space used by the stack is also a basic requirement, so it is also allocated when the process is created, as shown in Figure 3.1 :

 

 

 Figure 3.1 Division of process virtual space

      In the kernel, each area is represented by a structure struct vm_area_struct. It describes a continuous virtual memory space with the same access attributes, and the size of the virtual memory space is an integer multiple of the physical memory page. You can use  cat /proc//maps to view the memory usage of a process, pid is the process number. Each line displayed corresponds to a vm_area_struct structure of the process.

The following is the definition of the struct vm_area_struct structure:

  1. #include <linux/mm_types.h>

  2. /* This struct defines a memory VMM memory area. */

  3. struct vm_area_struct {
  4. struct mm_struct * vm_mm; /* VM area parameters */
  5. unsigned long vm_start;
  6. unsigned long vm_end;

  7. /* linked list of VM areas per task, sorted by address */
  8. struct vm_area_struct *vm_next;
  9. pgprot_t vm_page_prot;
  10. unsigned long vm_flags;

  11. /* AVL tree of VM areas per task, sorted by address */
  12. short vm_avl_height;
  13. struct vm_area_struct * vm_avl_left;
  14. struct vm_area_struct * vm_avl_right;

  15. /* For areas with an address space and backing store,
  16. vm_area_struct *vm_next_share;
  17. struct vm_area_struct **vm_pprev_share;
  18. struct vm_operations_struct * vm_ops;
  19. unsigned long vm_pgoff; /* offset in PAGE_SIZE units, *not* PAGE_CACHE_SIZE */
  20. struct file * vm_file;
  21. unsigned long vm_raend;
  22. void * vm_private_data; /* was vm_pte (shared mem) */
  23. };

      Usually, the virtual memory space used by the process is not continuous, and the access attributes of each part of the virtual memory space may also be different. Therefore, the virtual memory space of a process needs to be described by multiple vm_area_struct structures. When the number of vm_area_struct structures is small, each vm_area_struct is sorted in ascending order, and the data is organized in the form of a singly linked list (pointing to the next vm_area_struct structure through the vm_next pointer). However, when there is a lot of data in the vm_area_struct structure, the linked list is still used, which will inevitably affect its search speed. In response to this problem, vm_area_struct also adds three members: vm_avl_hight (tree height), vm_avl_left (left child node), and vm_avl_right (right child node) to implement the AVL tree to improve the search speed of vm_area_struct.

  If the vm_area_struct describes the virtual memory space mapped by a file, the member vm_file points to the file structure of the mapped file, and vm_pgoff is the file offset of the starting address of the virtual memory space in the vm_file file, in physical pages.

Figure 3.2 Schematic diagram of process virtual address 

Therefore, the work done by the mmap system call is to prepare such a virtual memory space, create a vm_area_struct structure, and pass it to a specific device driver.

2. Establish the mapping between the virtual address space and the physical address of the file or device (completed by the device driver)

  The second step in establishing file mapping is to establish a mapping between virtual addresses and specific physical addresses, which is achieved by modifying the process page table. The mmap method is a member of the file_opeartions structure:

  int (*mmap)(struct file *,struct vm_area_struct *);


There are 2 ways to create page tables in linux:

(1) Use remap_pfn_range to build all page tables at once.

   int remap_pfn_range(struct vm_area_struct *vma, unsigned long virt_addr, unsigned long pfn, unsigned long size, pgprot_t prot); 

return value:

Returns 0 on success, and a negative error value on failure. Parameter description:

The vma user process creates a vma area


virt_addr The user virtual address where remapping should start. This function builds the page table for this virtual address range from virt_addr to virt_addr_size.


pfn page frame number, corresponding to the physical address where the virtual address should be mapped. The page frame number is simply the physical address shifted right by PAGE_SHIFT bits. For most uses, the vm_paoff member of the VMA structure contains exactly the value you need. This function affects the physical address address from (pfn<


size The size of the area being remapped, in bytes.


prot The "protection" required for the new VMA. The driver can (and should) use the value found in vma->vm_page_prot.

(2) Use the nopage VMA method to create a page table entry at a time.

   struct page *(*nopage)(struct vm_area_struct *vma, unsigned long address, int *type);

return value:

Returns a valid mapped page on success, NULL on failure.

Parameter Description:

address represents the user space virtual address passed from user space.

Returns a valid map page.


(3) Restrictions on use:

remap_pfn_range cannot map conventional memory, only accesses reserved pages and physical addresses above the top of physical memory. Because reserved pages and physical addresses above the top of physical memory are not managed by various sub-modules of the memory management system. 640 KB and 1MB are reserved page possible mappings, and device I/O memory can also be mapped. If you want to map the memory applied by kmalloc() to user space, you can set the corresponding memory to be reserved through mem_map_reserve().

3.  The operation when the newly mapped page is actually accessed (completed by page fault interrupt)

(1) The distinction between pages in page cache and swap cache: The physical pages of an accessed file all reside in the page cache or swap cache, and all the information of a page is described by struct page. There is a field in the struct page for the pointer mapping, which points to a struct address_space type structure. All pages in the page cache or swap cache are distinguished according to the address_space structure and an offset.
 
(2) Correspondence between file and address_space structure: After a specific file is opened, the kernel will create a struct inode structure for it in memory, and the i_mapping field in it points to an address_space structure. In this way, a file corresponds to an address_space structure, and an address_space and an offset can determine a page in a page cache or swap cache. Therefore, when a certain data is to be addressed, it is easy to find the corresponding page according to the given file and the offset of the data within the file.  (3) When the process calls mmap(), it just adds a buffer of the corresponding size in the process space, and sets the corresponding access flag, but does not establish the mapping from the process space to the physical page. Therefore, the first time the space is accessed, a page fault exception is thrown. 


(4) In the case of shared memory mapping, the page fault exception handler first looks for the target page in the swap cache (a physical page that matches address_space and offset), and if found, returns the address directly; if not found, judges the page Whether it is in the swap area (swap area), if so, perform a swap-in operation; if the above two conditions are not satisfied, the handler will allocate a new physical page and insert it into the page cache. The process will eventually update the process page table. 

      Note: For the case of mapping common files (non-shared mapping), the page fault exception handler will first look for the corresponding page in the page cache according to address_space and data offset. If it is not found, it means that the file data has not been read into the memory, the handler will read the corresponding page from the disk and return the corresponding address, and at the same time, the process page table will also be updated.

(5) When all processes map the same shared memory area, the situation is the same. After the mapping between the linear address and the physical address is established, regardless of the respective return addresses of the processes, the actual access must be the same shared memory area corresponding to of physical pages.  


From: http://blog.chinaunix.net/uid-26669729-id-3077015.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324439857&siteId=291194637