Implement the readline algorithm in the stream

The old rules, let's talk about the principles first.

Byte stream, character stream, object stream

Stream is the flow of data, and all data transmission is a stream, whether within or between platforms. But sometimes we need to split a whole data into several chunks, and to process each chunk when it flows, we need to use the stream api.

  • Such as streaming media technology. From the perspective of the browser, we watch online videos, and we can watch and download them without waiting for the video to be fully buffered.

  • Such as downloading large files. From the perspective of the server, you can read a large file from the database and send it to the front end. You don’t need to take the entire file out and put it in the memory before sending it to the front end. You can build a pipeline to let the file flow to the front end bit by bit, saving time and time. Save effort.

The chunk is the smallest segmentation unit of the stream. According to the size of the chunk, the stream can be classified into byte stream, character stream, and object stream. These are the three most commonly used streams. As the name suggests, their smallest segmentation unit is one byte, one character, and one (JS) object. But today we will write a new flow type: paragraph flow.

In the computer world, a line is a paragraph, a paragraph is a line, and a paragraph chunk is a string that does not contain line breaks. A stream of one chunk is called a paragraph stream or a line stream.

Popular science:

There are 3 behaviors of dragging in the text: directly press and drag to select the text in units of a single character; double-click and hold and drag to select in units of words; stand-alone three times and hold and drag the meeting as a unit select. This is the mouse behavior defined in the last century, but many people don't know it yet.

Readline source code analysis

Due to the different length of a line, many platforms do not provide paragraph flow. Fortunately, nodejs does. The readline module built into the nodejs standard library is an interface that can read line by line from the readable stream.

Reading from the memory line by line and from the external memory line by line is completely different, because the memory belongs to the computer, and the external memory belongs to the external device. From the perspective of the computer core, read a file from the external memory and read one from the network The files are the same. It is very easy to simply read a line of character strings from memory, but time and space efficiency should be considered when reading a line from external memory and from the file system.

Readable stream, deformed stream, writable stream

Classified according to the direction of flow, three concepts emerged: readable flow, deformed flow, and coughing blood flow. According to the order, data is generally read from the readable stream, passes through zero or several deformed streams, and finally is written to the writable stream. readline is a kind of transform stream, which transforms the written character stream, assembles it into a paragraph stream and reads it out. The assembly process can be explained with the following figure:

First, we prepare a buffer queue (entering from right to left) to temporarily store strings. Since the character stream is given to us every time it is a string chunk, there may be several newline characters \n, we need to chunk.split('\n'), and then get a string list. Except for the last string in the list, the other strings are definite strings, which can be read in order, but the last string may not end. It is possible that the length of the string will be increased after the next trunk comes in, so the last string Stay in the queue temporarily.

After the next trunk comes in, it reads out all the previous strings in the same way, and keeps the last string. All trunks follow this method until the end of the last trunk, all strings in the queue are read out.

Through this algorithm, the paragraph stream can read one line from the external memory file every time. The most important thing is that the memory consumed is completely independent of the file size.

The readline algorithm seems to be very simple. Why not write a lineReader.js by hand:

const Transform = require('stream').Transform;




module.exports = class extends Transform {
    constructor() {
        super({
            // 写入方向
            writableObjectMode: false,
            // 读出方向
            readableObjectMode: true,
        })
        this.queue = ''
    }
    
    _transform(chunk, encoding, next) {
        this.queue += chunk.toString();


        const lines = this.queue.split('\n')


        this.queue = lines.pop();


        lines.forEach(line => this.push(line));


        // this.push或next二选一传递chunk
        next();
    }
    // 最后一个chunk结束后
    _flush(callback) {
        this.queue.split('\n').forEach(line => {
            this.push(line)
        })
        callback();
    }
}

Have you seen it? The entire lineReader inherits the Transform class, covering the _transform method to handle the trunk that is written each time, and _flush to handle the last trunk. The whole process is very simple, and the method of use is the same as other deformed streams, which flow through pipe or monitor data events:

const fs = require('fs');
const lineReader = require('./lineReader.js')
fs.createReadStream('path/to/textFile.txt', { encoding: 'utf8' })
  .pipe(new lineReader())
  .on('data', line => {
     console.log('------new line------       ', line);
})

The readline module of nodejs has the same principle as our lineReader, except that there are some error handling mechanisms and some auxiliary methods are encapsulated. Therefore, it is better to use the readline module in the production environment. After all, it is a standard library.

Markup language flow, functional code flow

The streaming media technology mentioned earlier not only serves pictures, audio and video, but also works on web pages, I didn't expect it. Our markup languages ​​such as html and json can be rendered in real time (please refer to ndjson for json streaming). In addition, functional programming language source files can also be vulcanized, because functional programming languages ​​are composed of expressions. In theory, a js file can be compiled on the fly through the "expression stream", but the damn "variable promotion" The mechanism destroys the ability of JavaScript streaming, making the browser have to wait for the transfer of the entire js file to start parsing.

As a front-end knows, the volume of js files in modern web pages is much larger than that of html files. What is the significance of the instant rendering of html in this environment? In order to generate long html, the backend has to use a template engine: this indirectly destroys the separation of front and back ends. Therefore, the EcmaScript committee has been urging everyone to use let instead of var, and even advised everyone not to put all the code in a closure (making the expression too large and difficult to stream). But what's the use? After so many years, nothing has changed. I can only blame too many fake programmers and too few people who pay attention to code performance.

(Finish)

[One cat per day]


Reference

  1. https://jimmy.blog.csdn.net/article/details/103221076

  2. https://jimmy.blog.csdn.net/article/details/90678160

  3. https://jimmy.blog.csdn.net/article/details/100915601

  4. https://developer.mozilla.org/en-US/docs/Web/API/Streams_API

  5. https://nodejs.org/api/readline.html#readline_readline

Guess you like

Origin blog.csdn.net/github_38885296/article/details/103296266