[Linux] In-depth understanding of the buffer

Table of contents

what is buffer

Why is there a buffer

buffer flush strategy

where is the buffer

 Manually design a userland buffer


what is buffer

A buffer is essentially an area of ​​memory used to hold temporary data. Buffers are used in a wide variety of computing tasks, including input/output operations, network communications, image processing, audio processing, and more.

Who provided this memory area , and where is the buffer ? You can continue to look down.

Here to tell the answer first, it is provided by the C standard library .

Why is there a buffer

The buffer is used to solve the problem of mismatched or unstable data transmission speed, and to improve the efficiency of data processing.

When reading a large amount of data from the hard disk, transferring the data directly to the memory may cause a mismatch in the read and write speeds (memory is fast, and the hard disk is slow to read, which is relatively speaking), resulting in performance bottlenecks. In order to alleviate this problem, a buffer can be introduced to read a part of the data into the buffer first , and then gradually read the data from the buffer into the memory to balance the data transmission speed.

Here is a good example to explain:

For example, you and your friend are in two different universities, with a difference of about 500 kilometers. One day you want to send some books to your friend. At this time, you can choose to ride a bicycle to deliver these books in person. The gift is light and the affection is heavy. Well, with a break in the middle, and due to the slow speed, it took about a week to arrive, and after sending it off, I rode back to my school, which took another week, and it took a total of two weeks to complete the work. It takes too long .

Assuming that you are smart at this time, since it is so slow, you can take the high-speed rail to send it directly, but the round trip will cost more than 500, which is more than the value of these books, that is, the cost is too high .

The above books can be regarded as resources, and this mode is called write-through mode.

At this time, you thought that you can send these books by express, the price is cheap, and it will arrive in two or three days, which is very affordable, so you handed these books to SF Express. Let me tell you that I have received these books, and then I have successfully handed over the resources to the other party. The role played by SF Express here is the buffer zone. 

SF Express does not deliver your courier immediately after getting your courier, but waits for the quantity to be enough, and then starts shipping again, which is equivalent to a buffer refreshing strategy.

buffer flush strategy

There are three main refresh strategies:

1. Refresh now

2. Line refresh ( line buffering), encounter \n refresh

3. Full refresh (full buffer), which means that the input or output data is completely stored in the buffer, and then transmitted or processed.

Of course there are some special cases:

1. User forced refresh (fflush)

2. The process exits

When encountering the above two situations, the data in the buffer must be refreshed immediately instead of continuing to wait according to the previous refresh strategy.

So buffer strategy = general case + special case.


Generally speaking, the device file for line buffering --- display

Fully buffered device files --- disk files

But all devices always tend to be fully buffered --> the buffer is full and then refreshed --> fewer IO operations are required --> fewer peripheral accesses (equivalent to improving the efficiency of the whole machine).

Some students may have doubts, such as 10 lines of data, each line has 100 bytes, although the 10 lines are refreshed together at the end, only one peripheral access is performed, but the amount of data is a lot, 1000 bytes, and the row by line Although the refresh is refreshed 10 times, the amount of data is small each time, so why is it better to have as few peripheral accesses as possible?

This is because when IO with external devices, the size of the data is not the main contradiction, and the process of preparing IO with peripheral devices is the most time-consuming.

For example, when you borrow money from others, the communication process often takes a long time, but the transfer process only takes a few seconds, the same reason.

Then we can directly change to full buffering, right? Isn't this efficient? What kind of line buffering is needed?

In fact, these strategies are all compromises based on the actual situation:

For example, line buffering is aimed at the display and is for users to see. On the one hand, efficiency must be taken care of , and on the other hand, user experience must also be taken care of.

Usually, some text files we open are fully buffered, and they will be saved once the user has finished writing .

With these buffers and strategies, the efficiency of data processing can be improved.

where is the buffer

To solve this problem, we can first write the following code:

int main()    
{    
    //C语言提供的接口    
    printf("hello,printf\n");    
    fprintf(stdout,"hello,fprintf\n");    
    const char* s = "hello,fputs\n";    
    fputs(s,stdout);    
    
    //系统接口    
    const char* ss = "hello,write\n";    
    write(1,ss,strlen(s));    
    //注意这里有一个创建子进程
    fork();  
}

Here we put the fork function of creating a child process at the end. What's the use of this? What's the point of putting it at the end of the child process without executing anything?

Let's run the code first:

 As you can see, these are all output normally without any problems.

But we create a log.txt file at this time, and then redirect the output to it, and then cat to see the contents of the card:

Very strange phenomenon: we found that except for the system interface , which is only output once , the functions provided by C language are output twice.

What is the reason?


According to the refresh strategy of the buffer we mentioned above:

We directly run the program to print to the display, using the line refresh strategy, and we redirect to the file and print to the file, which becomes a full buffer strategy.

1. If it is printing to the display, then the line refresh strategy is adopted, then when the fork is executed at the end, all the data has been refreshed , and it is meaningless to execute the fork at this time.

2. If the program is redirected, that is, to print to the file at this time, the refresh strategy will implicitly become full buffering at this time .

The \n line break is meaningless.

When forking, the function must have been executed, but the data has not been refreshed! These data are in the buffer in the C standard library corresponding to the current process . These data belong to the parent process .

After the code is executed, it does not mean that the data has been refreshed . When the child process and the parent process execute return 0 after fork, that is, when the data is to be refreshed, copy-on-write occurs , so that there are two copies of the data, and then output them separately to the file.

Therefore, the function output by the C language standard library is printed twice, and the system interface is printed once.

Because the system interface is directly written into the file without going through the buffer.

At this point, we are even more convinced of the fact that the buffer must not be provided by the operating system ! It is provided by the C language standard library ! Because if it is provided by the operating system, then this system interface should also be output twice, not only once.


So where is it ?

Let's add another piece of code:

    fflush(stdout);

 Let's run it again, and output this to the log.txt file:

We found that only one statement was output at this time, instead of two statements. 

After the above explanation, I believe everyone can understand that fflush has forcibly flushed the contents of the buffer. At this time, the buffer is empty , and then execute fork, the parent and child process buffers are empty, so There is no data refresh, so only one statement is printed.

Then note that our fflush is provided by the C language, and we only provide a stdout parameter, so how does it find the buffer?

In C language, the corresponding function to open a file is fopen. Its function prototype is as follows:

Its function return value is FILE *, which is a struct file, which not only encapsulates fd, but also contains the buffer structure _IO_FILE of the language layer corresponding to the file fd.

The internal structure of _IO_FILE is roughly as follows:

 

 So the C language buffer exists in the FILE structure .

What needs to be explained here is that there are also buffers in the kernel , but the kernel buffer and the user-level buffer (such as the buffer provided by C language) are independent and do not affect each other.

 Manually design a userland buffer

We know that the buffer is encapsulated in a file structure, and there are file descriptors fd, etc., so first we need to create a structure to encapsulate these data.

#define NUM 1024                                                                                                                      
struct MyFILE_{    
 int fd;//文件描述符
 char buffer[NUM];//缓冲区
 int end;//当前缓冲区的结尾    
};    
typedef struct MyFILE_ MyFILE;    

Similarly, in order to use, there are four interfaces to use, namely fopen_, fputs_, fflush_, fclose_ these four interfaces.

Let’s talk about fopen_ first, which mainly has two parameters. The first parameter is the pathname of the file to be opened , the second is the open mode mode (r, r+, w, w+, a, a+), and finally returns the MyFILE structure.

These six modes are not all written one by one, only one w mode is written, which is enough for us to use.

The idea of ​​the method is: first call the system open() interface, then pass in the parameters, the open mode is O_WRONLY, O_TRUNC, O_CREAT, and then accept the returned fd

If fd is greater than or equal to 0, it means that it is opened, and then open space for the FILE structure and initialize it at this time, and make the fd in FILE equal to the fd obtained by opening just now

MyFILE* fopen_(const char* pathname,const char* mode)
{
  assert(pathname);
  assert(mode);
  
  MyFILE* fp = NULL;
  if(strcmp(mode,"r") == 0)
  {

  }
  else if(strcmp(mode,"r+") == 0)
  {

  }
  else if(strcmp(mode,"w") == 0)
  {
    int fd = open(pathname,O_WRONLY | O_TRUNC | O_CREAT,0666);
    if(fd >= 0)
    {
      fp = (MyFILE*)malloc(sizeof(MyFILE));
      memset(fp,0,sizeof(MyFILE));
      fp->fd = fd;

    }
    
  }
  else if(strcmp(mode,"w+") == 0)
  {

  }
  else if(strcmp(mode,"a") == 0)
  {

  }
  else if(strcmp(mode,"a+") == 0)
  {

  }
  return fp;
}

Next is fputs_, the main function is to write content to the specified file descriptor fd (essentially write to the buffer ). So the first parameter is the content to write, and the second parameter is the MyFILE structure.

The main idea is to first copy the incoming content to the buffer buffer in MyFILE, and then update the length of end.

At this time, we will judge whether fd is 0, 1, 2 or other files. Here we only realize the case of fd=1

When fd=1, first judge whether the last character in the buffer is '\0', if so, then write to 1, that is, refresh the contents of the buffer, and set end to 0.

If not, no action is required.

void fputs_(const char* message, MyFILE* fp)
{
  assert(message);
  assert(fp);

  strcpy(fp->buffer+fp->end,message);                                                                                                 
  fp->end += strlen(message);

  //for debug
  //printf("%s\n",fp->buffer);
  //暂时没有刷新,刷新策略是是来执行的呢? 用户通过执行C标准库中的代码逻辑,来完成刷新动作
  //效率提高体现在哪里呢?因为C提供了缓冲区,那么我们就通过策略,减少了IO次数的执行次数,不是数据量!
  if(fp->fd == 0)
  {
    //标准输入
  }
  else if(fp->fd == 1)
  {
    //标准输出
    if(fp->buffer[fp->end-1] == '\n')
    {
      //fprintf(stderr,"fflush: %s",fp->buffer);
      write(fp->fd,fp->buffer,fp->end);
      fp->end = 0;
    }
  }
  else if(fp->fd == 0)
  {
    //标准错误
  }
  else
  {
    //其他文件
  }
}

Next is fflush_, the main function of this function is to force flush the content in the buffer.

The main idea is: first judge whether the buffer content is empty, if not, write to the file numbered fd, then call the syncfs function to write the data to the disk, and finally set end to 0.

void fflush_(MyFILE* fp)
{
  assert(fp);
  if(fp->end != 0)
  {
    //暂且认为刷新了 -- 其实是把数据写到了内核
    write(fp->fd,fp->buffer,fp->end);
    syncfs(fp->fd);//将数据写入到磁盘
    fp->end = 0;
  }
}

The last one is fclose_, the function of this function is to close the file and refresh the buffer content.

This is relatively simple, just reuse the fflush_ refresh buffer just now and call the close function to close the file

void fclose_(MyFILE* fp)
{
  assert(fp);
  fflush_(fp);
  close(fp->fd);
  free(fp);

}

 The content of the buffer zone is explained here. If you have any questions or mistakes, please feel free to make or correct them in the comment area or private message.

Guess you like

Origin blog.csdn.net/weixin_47257473/article/details/131913698