Go source code learning: bufio package-1.2-scan.go-(2)

bufio package official documentation

Go source code learning-index directory

Previous article:
Go source code learning: bufio package-1.2-scan.go-(1)

10. Scan: Scan the next mark

This part of the code defines Scanmethods that advance the [Scanner] to the next token, which can then be accessed via the [Scanner.Bytes] or [Scanner.Text] methods. It returns false when there are no more tokens, whether the end of the input is reached or an error occurs. After Scanreturning false, the [Scanner.Err] method will return any errors that occurred during the scan, except [io.EOF], in which case [Scanner.Err] will return nil.

If the split function returns too many empty tokens without advancing the input early, Scana panic will occur. This is a common error pattern with scanners.

func (s *Scanner) Scan() bool {
    
    
    if s.done {
    
    
        return false
    }
    s.scanCalled = true

    // 循环直到我们有一个标记。
    for {
    
    
        // 查看我们是否可以使用我们已有的内容获取一个标记。
        // 如果我们已经用完了数据但有错误,给分割函数一个机会来恢复任何剩余的、可能为空的标记。
        if s.end > s.start || s.err != nil {
    
    
            advance, token, err := s.split(s.buf[s.start:s.end], s.err != nil)
            if err != nil {
    
    
                if err == ErrFinalToken {
    
    
                    s.token = token
                    s.done = true

                    // 当标记不为nil时,表示扫描停止时有一个尾随标记,因此返回值应为true,表示存在标记。
                    return token != nil
                }
                s.setErr(err)
                return false
            }

            if !s.advance(advance) {
    
    
                return false
            }

            s.token = token

            if token != nil {
    
    
                if s.err == nil || advance > 0 {
    
    
                    s.empties = 0
                } else {
    
    
                    // 返回未在EOF时推进输入的标记。
                    s.empties++

                    if s.empties > maxConsecutiveEmptyReads {
    
    
                        panic("bufio.Scan: too many empty tokens without progressing")
                    }
                }

                return true
            }
        }

        // 我们无法使用我们手头的内容生成标记。
        // 如果我们已经达到EOF或出现I/O错误,我们完成了。
        if s.err != nil {
    
    
            // 关闭它。
            s.start = 0
            s.end = 0
            return false
        }

        // 必须读取更多数据。
        // 首先,如果有很多空间或需要空间,将数据移动到缓冲区的开头。
        if s.start > 0 && (s.end == len(s.buf) || s.start > len(s.buf)/2) {
    
    
            copy(s.buf, s.buf[s.start:s.end])
            s.end -= s.start
            s.start = 0
        }

        // 缓冲区是否已满?如果是,调整大小。
        if s.end == len(s.buf) {
    
    
            // 保证在下面的乘法中没有溢出。
            const maxInt = int(^uint(0) >> 1)

            if len(s.buf) >= s.maxTokenSize || len(s.buf) > maxInt/2 {
    
    
                s.setErr(ErrTooLong)
                return false
            }

            newSize := len(s.buf) * 2

            if newSize == 0 {
    
    
                newSize = startBufSize
            }

            newSize = min(newSize, s.maxTokenSize)

            newBuf := make([]byte, newSize)
            copy(newBuf, s.buf[s.start:s.end])
            s.buf = newBuf
            s.end -= s.start
            s.start = 0
        }

        // 最后,我们可以读取一些输入。确保我们不会因为阻塞的Reader而陷入困境。
        // 官方上我们不需要这样做,但让我们格外小心:Scanner是用于安全、简单的任务的。
        for loop := 0; ; {
    
    
            n, err := s.r.Read(s.buf[s.end:len(s.buf)])

            if n < 0 || len(s.buf)-s.end < n {
    
    
                s.setErr(ErrBadReadCount)
                break
            }

            s.end += n

            if err != nil {
    
    
                s.setErr(err)
                break
            }

            if n > 0 {
    
    
                s.empties = 0
                break
            }

            loop++

            if loop > maxConsecutiveEmptyReads {
    
    
                s.setErr(io.ErrNoProgress)
                break
            }
        }
    }
}

explain:

  • ScanThe method is Scannera method of the struct and has no explicit receiver, but it uses sthe fields of the struct.
  • First, if s.doneit is true, it means that the scan has been completed and false will be returned directly.
  • Then, s.scanCalledset to true to indicate that the scan method has been called.
  • Then, loop through until you get a token.
    • First, check if you can get a tag using the existing data. If the data is exhausted but there are errors, give the split function a chance to recover any remaining, possibly empty tags.
    • If the segmentation function returns an error, it is processed according to the error type. If it is ErrFinalToken, it means that the scan is stopped, and it s.tokenis set to the mark. s.doneSet to true, and return true to indicate that there is a mark; otherwise, set the error and return false.
    • If the tag is successfully obtained, the corresponding processing is performed and true is returned to indicate that the tag exists.
  • If the mark cannot be generated, check whether the end of file has been reached or an I/O error occurred, and if so, clear the data and return false.
  • If more data needs to be read, corresponding processing is performed, including moving data, adjusting buffer size, reading input, etc.

effect:

  • ScanThe main function of the method is to scan the next mark and advance [Scanner] to the position of the next mark.
  • It obtains tags based on existing data and reads new data, while handling possible error conditions.
  • During the scanning process, corresponding processing will be carried out according to different situations to ensure that the mark can be correctly acquired and the scanning position advanced.

11. advance: consume n bytes in the buffer

This part of the code defines advancethe method for consuming n bytes from the buffer. It reports whether the consumption was legal.

// advance 消耗缓冲区中的 n 个字节。它报告了这次消耗是否合法。
func (s *Scanner) advance(n int) bool {
    
    
    if n < 0 {
    
    
        s.setErr(ErrNegativeAdvance)
        return false
    }
    if n > s.end-s.start {
    
    
        s.setErr(ErrAdvanceTooFar)
        return false
    }
    s.start += n
    return true
}

explain:

  • advanceThe method is Scannera method of the struct and has no explicit receiver, but it uses sthe fields of the struct.
  • First, check if the number of bytes consumed nis less than zero, if so, set the error and return false.
  • Then, it checks whether the number of bytes consumed nexceeds the number of bytes available in the buffer, and if so, sets an error and returns false.
  • Finally, s.startmove the starting position of the buffer forward by n bytes and return true to indicate that the consumption is legal.

effect:

  • advanceThe main function of the method is to consume n bytes in the buffer and report whether this consumption is legal.
  • It checks accordingly based on the number of bytes consumed and updates the starting position of the buffer.
  • Before consumption, some legality checks will be performed to ensure the correctness of the consumption operation.

12. setErr: record the first error encountered

This part of the code defines setErrthe method that logs the first error encountered.

// setErr 记录遇到的第一个错误。
func (s *Scanner) setErr(err error) {
    
    
    if s.err == nil || s.err == io.EOF {
    
    
        s.err = err
    }
}

explain:

  • setErrThe method is Scannera method of the struct and has no explicit receiver, but it uses sthe fields of the struct.
  • First, check if the current error is NULL or io.EOF, and if so, log the error encountered in s.err.

effect:

  • setErrThe main purpose of the method is to log the first error encountered.
  • It will check whether there is an error currently, and if not, record the error encountered for subsequent processing.

13. Buffer: Set the initial buffer and maximum allocated buffer size during scanning

This part of the code defines Buffermethods for setting the initial buffer and maximum allocated buffer size when scanning. The maximum token size must be less than the larger of maxand . cap(buf)If maxless than or equal to cap(buf), [Scanner.Scan] will only use the buffer and not allocate it.

// Buffer 设置扫描时的初始缓冲区和最大分配缓冲区大小。
// 最大令牌大小必须小于 max 和 cap(buf) 中较大的那个。
// 如果 max 小于等于 cap(buf),[Scanner.Scan] 将仅使用该缓冲区,并且不进行分配。
//
// 默认情况下,[Scanner.Scan] 使用内部缓冲区,并将最大令牌大小设置为 [MaxScanTokenSize]。
//
// 如果在扫描开始后调用,Buffer 将引发 panic。
func (s *Scanner) Buffer(buf []byte, max int) {
    
    
    if s.scanCalled {
    
    
        panic("Buffer called after Scan")
    }
    s.buf = buf[0:cap(buf)]
    s.maxTokenSize = max
}

explain:

  • BufferThe method is Scannera method of the structure and has two parameters: bufindicating the initial buffer and maxindicating the maximum allocated buffer size.
  • If scanning has already started ( s.scanCalledtrue), the call Bufferwill panic.
  • Set Scannerthe structure's buffield to the slice of the incoming buffer, and maxTokenSizeset the field to the maximum allocated buffer size.

effect:

  • BufferThe function of the method is to configure the buffer during scanning and the maximum allocated buffer size.
  • If the user-supplied buffer is large enough and the maximum token size does not exceed the buffer's capacity, additional memory allocations will be avoided.
  • If called after the scan has started, a panic will be raised to ensure that the configuration operation completes before the scan starts.

14. Split: Set the split function of [Scanner]

This part of the code defines Splitthe method for setting the segmentation function of [Scanner]. The default segmentation function is [ScanLines].

// Split 设置 [Scanner] 的分割函数。
// 默认分割函数是 [ScanLines]。
//
// 如果在扫描开始后调用,Split 将引发 panic。
func (s *Scanner) Split(split SplitFunc) {
    
    
    if s.scanCalled {
    
    
        panic("Split called after Scan")
    }
    s.split = split
}

explain:

  • SplitThe method is Scannera method of the structure and has one parameter splitrepresenting the user-defined splitting function.
  • If scanning has already started ( s.scanCalledtrue), the call Splitwill panic.
  • Set the field Scannerof the structure splitto the passed-in split function.

effect:

  • SplitThe function of the method is to configure the segmentation function used by [Scanner].
  • Users can define how to split the input data during scanning by setting a custom split function.
  • If called after the scan has started, a panic will be raised to ensure that the configuration operation completes before the scan starts.

15. SplitBytes: split function for [Scanner]

This part of the code defines ScanBytesthe function as the split function of [Scanner], which returns each byte as a token.

// ScanBytes 是 [Scanner] 的分割函数,将每个字节作为一个token返回。
func ScanBytes(data []byte, atEOF bool) (advance int, token []byte, err error) {
    
    
    if atEOF && len(data) == 0 {
    
    
        return 0, nil, nil
    }
    return 1, data[0:1], nil
}

explain:

  • ScanBytesIt is a segmentation function that conforms to the interface specification of [Scanner] and receives two parameters: input data dataand whether the end of the file has been reached atEOF.
  • If the end of file has been reached and the input data is empty, zero is returned to indicate that there are no more tokens.
  • Otherwise, the length in one byte is returned, along with a slice containing the first byte as the token.

effect:

  • ScanBytesUsed to define how [Scanner] cuts the input data during the scanning process. Here, each byte is treated as an independent token.

16. errorRune: UTF-8 decoded wrong byte slice

This part of the code defines errorRunethe variable, which is a slice of bytes containing the UTF-8 decoded error characters.

var errorRune = []byte(string(utf8.RuneError))

explain:

  • errorRuneis a byte slice containing the byte representation of the UTF-8 decoded error character.
  • Use here utf8.RuneErrorto get special characters with UTF-8 decoding errors.

effect:

  • errorRuneA special byte sequence used to represent when an error occurs during UTF-8 decoding, often used to mark parts that cannot be decoded correctly.

17. ScanRunes: Scan UTF-8 encoded runes

This code defines ScanRunesthe method, as [Scanner]'s split function, that returns each UTF-8 encoded rune as a token. The returned rune sequence is equivalent to a rune sequence range-looped over the input string, meaning that incorrect UTF-8 encoding will be converted to U+FFFD = "\xef\xbf\xbd". Due to the existence of the Scan interface, the client cannot distinguish between correctly encoded alternate runes and incorrectly encoded runes.

// ScanRunes是[Scanner]的分割函数,返回每个UTF-8编码的符文作为一个标记。
// 返回的符文序列等效于对输入字符串进行范围循环的符文序列,这意味着错误的UTF-8编码将转换为U+FFFD = "\xef\xbf\xbd"。
// 由于Scan接口的存在,这使得客户端无法区分正确编码的替代符文和编码错误。
func ScanRunes(data []byte, atEOF bool) (advance int, token []byte, err error) {
    
    
    // 如果在EOF并且数据为空,则返回0。
    if atEOF && len(data) == 0 {
    
    
        return 0, nil, nil
    }

    // 快速路径1:ASCII。
    if data[0] < utf8.RuneSelf {
    
    
        return 1, data[0:1], nil
    }

    // 快速路径2:正确的UTF-8解码,没有错误。
    _, width := utf8.DecodeRune(data)
    if width > 1 {
    
    
        // 这是一个有效的编码。对于正确编码的非ASCII符文,宽度不能为1。
        return width, data[0:width], nil
    }

    // 我们知道这是一个错误:宽度==1且隐式r==utf8.RuneError。
    // 错误是因为没有完整的符文可以解码吗?
    // FullRune在错误和不完整的编码之间正确地进行了区分。
    if !atEOF && !utf8.FullRune(data) {
    
    
        // 不完整;获取更多字节。
        return 0, nil, nil
    }

    // 我们有一个真正的UTF-8编码错误。返回一个正确编码的错误符文
    // 但仅提前一个字节。这与对不正确编码的字符串进行范围循环的行为相匹配。
    return 1, errorRune, nil
}

explain:

  • ScanRunesThe method is a split function for [Scanner] that splits data into runes according to UTF-8 encoding.
  • First, check if it is at EOF and the data is empty, and return 0.
  • Then, check for ASCII characters via fast path 1 and return 1 if so.
  • Next, check for correct UTF-8 decoding via fast path 2 and return the corresponding width if so.
  • If none of the above conditions are met, it means there is a UTF-8 encoding error, and corresponding processing logic will be adopted based on the error type.
  • Finally, advance (the number of bytes to advance), token (the byte sequence of the rune) and error information are returned.

effect:

  • ScanRunesThe main function of the method is to split the UTF-8 encoded data by runes and provide it to [Scanner] for use.
  • It correctly handles ASCII characters and UTF-8 encoding errors, ensuring that the split rune sequences are compliant.

18. dropCR: remove the \r at the end

This code defines dropCRthe method that removes the trailing carriage return character from the data \r.

// dropCR从数据中去除末尾的\r。
func dropCR(data []byte) []byte {
    
    
    // 如果数据长度大于0且末尾是\r,则返回去除末尾的数据。
    if len(data) > 0 && data[len(data)-1] == '\r' {
    
    
        return data[0 : len(data)-1]
    }
    // 否则返回原始数据。
    return data
}

explain:

  • dropCRThe method receives a byte array dataand is used to remove the carriage return character at the end of the data \r.
  • First, check whether the data length is greater than 0 and the end is \r, if so, return the data with the end removed.
  • Otherwise, return the original data.

effect:

  • dropCRThe main function of the method is to process data and remove possible carriage returns at the end \r.

19. ScanLines: Scan lines of text

This code defines ScanLinesthe method, which acts as the [Scanner] split function, returning each line of text, with any trailing newlines removed. The rows returned may be empty. The newline character is an optional carriage return character followed by a required newline character. In regular expression notation, it is \r?\n. That is, the last non-empty line will be returned even if there is no newline character.

// ScanLines是[Scanner]的分割函数,返回每行文本,去除任何末尾的换行符。
// 返回的行可能为空。换行符是一个可选的回车符,后面跟着一个必需的换行符。
// 在正则表达式表示中,它是 \r?\n。即最后一个非空行即使没有换行符也会被返回。
func ScanLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
    
    
    // 如果在EOF时并且数据为空,则返回0。
    if atEOF && len(data) == 0 {
    
    
        return 0, nil, nil
    }
    // 如果在数据中找到换行符,则返回完整的以换行符终止的行。
    if i := bytes.IndexByte(data, '\n'); i >= 0 {
    
    
        // 我们有一个完整的以换行符终止的行。
        return i + 1, dropCR(data[0:i]), nil
    }
    // 如果在EOF时,有最后一个非终止的行,则返回它。
    if atEOF {
    
    
        return len(data), dropCR(data), nil
    }
    // 请求更多数据。
    return 0, nil, nil
}

explain:

  • ScanLinesMethod is a split function for [Scanner] that returns the text line by line, stripping trailing newlines.
  • First, check if it is at EOF and the data is empty, and return 0.
  • It then searches for newline characters in the data and returns the complete newline-terminated line if found.
  • If, at EOF, there is a last non-terminating row, it is returned.
  • If none of the above conditions are met, more data is needed.

effect:

  • ScanLinesThe main function of the method is to split the text data by lines and remove the newline characters at the end of the lines.
  • It can handle complete lines containing newlines, as well as the last non-terminating line at EOF.

Guess you like

Origin blog.csdn.net/weixin_49015143/article/details/135285205