Kafka系列3----Log分析

Log是对segment的抽象,对多个segment的封装,外部只需要操作Log就可以了,不需要考虑往哪个segment读,往哪个segment写,内部会处理这些细节

看下Log的几个核心变量和函数:

    @volatile private var nextOffsetMetadata: LogOffsetMetadata = _

    private val segments: ConcurrentNavigableMap[java.lang.Long, LogSegment] = new ConcurrentSkipListMap[java.lang.Long, LogSegment]

    val topicPartition: TopicPartition = Log.parseTopicPartitionName(dir)

    private val tags = Map("topic" -> topicPartition.topic, "partition" -> topicPartition.partition.toString)
    @volatile var dir: File

    def logEndOffset: Long = nextOffsetMetadata.messageOffset
  • segments:segment文件的集合,使用了ConcurrentSkipListMap类型(原理可以另行百度,这里不做详细分析)。key为baseOffset,即系列2里说的,例如000.log/index则baseOffset=0,238.log/index则baseOffset=238;value为Segment(后续分析)
  • nextOffsetMetadata:主要记录message的offset,Segment的baseOffset,Segment的log的大小
  • dir:Segment文件在磁盘上对应位置的File对象,即xxx/topic-artition/
  • logEndOffset:nextOffsetMetadata的messageOffset的属性,即下一个消息的offset,也可以当做是当前最后的一个offset

segments的初始化

for (file <- dir.listFiles if file.isFile) {
    val filename = file.getName
    if (filename.endsWith(IndexFileSuffix) || filename.endsWith(TimeIndexFileSuffix)) {
        //segment index文件初始化....
    } else if (filename.endsWith(LogFileSuffix)) {
        //segment log文件初始化....
        segments.put(start, segment)
    }
}

启动的时候对partition下的segments进行遍历,填充segments

index初始化

    val logFile =
        if (filename.endsWith(TimeIndexFileSuffix))
            new File(file.getAbsolutePath.replace(TimeIndexFileSuffix, LogFileSuffix))
        else
            new File(file.getAbsolutePath.replace(IndexFileSuffix, LogFileSuffix))

    if (!logFile.exists) {
        warn("Found an orphaned index file, %s, with no corresponding log file.".format(file.getAbsolutePath))
        file.delete()
    }

没有做什么事情,主要是确保index对于的log文件存在

log初始化

//取得baseOffset,即238.log为238
val start = filename.substring(0, filename.length - LogFileSuffix.length).toLong
    val indexFile = Log.indexFilename(dir, start)
    val timeIndexFile = Log.timeIndexFilename(dir, start)

    val indexFileExists = indexFile.exists()
    val timeIndexFileExists = timeIndexFile.exists()
    //初始化Segment
    val segment = new LogSegment(dir = dir,startOffset = start,
        indexIntervalBytes = config.indexInterval,
        maxIndexSize = config.maxIndexSize,
        rollJitterMs = config.randomSegmentJitter,
        time = time, fileAlreadyExists = true)

    if (indexFileExists) {
        try {//索引文件检查
            segment.index.sanityCheck()
            if (!timeIndexFileExists)
                segment.timeIndex.resize(0)
            segment.timeIndex.sanityCheck()
        } catch {
            //....
        }
    } else {
        error("Could not find index file corresponding to log file %s, rebuilding index...".format(segment.log.file.getAbsolutePath))
        segment.recover(config.maxMessageSize)
    }
    segments.put(start, segment)

start就是baseOffset,即上面说的

例如000.log/index则baseOffset=0,238.log/index则baseOffset=238

nextOffsetMetadata初始化

    locally {
        val startMs = time.milliseconds
        loadSegments()
        // def activeSegment = segments.lastEntry.getValue
        // 即segmnents最后的一个元素,即当前使用的Segment
        // LogSegment后面分析,nextOffsetMetadata记录了下一条消息的offset,Segment的baseOffset,Segment的log的大小
        nextOffsetMetadata = new LogOffsetMetadata(activeSegment.nextOffset, activeSegment.baseOffset, activeSegment.size)
        leaderEpochCache.clearAndFlushLatest(nextOffsetMetadata.messageOffset)
        logStartOffset = math.max(logStartOffset, segments.firstEntry().getValue.baseOffset)
        leaderEpochCache.clearAndFlushEarliest(logStartOffset)
        loadProducerState(logEndOffset, reloadFromCleanShutdown = hasCleanShutdownFile)
    }
    //
    private def loadSegments() {
        ....
        loadSegmentFiles()//上面说的segments的初始化
        ....
    }

添加消息

private def append(records: MemoryRecords, isFromClient: Boolean, assignOffsets: Boolean, leaderEpoch: Int): LogAppendInfo = {
        //对这批消息进行计算校验分析,返回结果
        val appendInfo = analyzeAndValidateRecords(records, isFromClient = isFromClient)

        ....

        try {
            lock synchronized {
                if (assignOffsets) {
                    // 从nextOffsetMetadata获取offset,在初始化的时候知道了,nextOffsetMetadata保存着activeSegment下一条消息的offset
                    val offset = new LongRef(nextOffsetMetadata.messageOffset)
                    appendInfo.firstOffset = offset.value//这批消息的起始值从nextOffsetMetadata获取,即当前Segment当前添加消息后的nextoffset
                    val now = time.milliseconds
                    val validateAndOffsetAssignResult = try {
                        LogValidator.validateMessagesAndAssignOffsets(validRecords,
                            offset,
                            now,
                            appendInfo.sourceCodec,
                            appendInfo.targetCodec,
                            config.compact,
                            config.messageFormatVersion.messageFormatVersion,
                            config.messageTimestampType,
                            config.messageTimestampDifferenceMaxMs,
                            leaderEpoch,
                            isFromClient)
                    } catch {
                    ....
                    }
                    // 计算
                    validRecords = validateAndOffsetAssignResult.validatedRecords
                    appendInfo.maxTimestamp = validateAndOffsetAssignResult.maxTimestamp
                    appendInfo.offsetOfMaxTimestamp = validateAndOffsetAssignResult.shallowOffsetOfMaxTimestamp
                    appendInfo.lastOffset = offset.value - 1
                    if (config.messageTimestampType == TimestampType.LOG_APPEND_TIME)
                        appendInfo.logAppendTime = now//记录写入log的时间

                    ....
                } else {
                    // we are taking the offsets we are given
                    if (!appendInfo.offsetsMonotonic || appendInfo.firstOffset < nextOffsetMetadata.messageOffset)
                        throw new IllegalArgumentException("Out of order offsets found in " + records.records.asScala.map(_.offset))
                }

                ....

                // 判断segment是否已经满了,如果满了需要写入到下一个segment
                val segment = maybeRoll(messagesSize = validRecords.sizeInBytes,
                    maxTimestampInMessages = appendInfo.maxTimestamp,
                    maxOffsetInMessages = appendInfo.lastOffset)

                val logOffsetMetadata = LogOffsetMetadata(
                    messageOffset = appendInfo.firstOffset,
                    segmentBaseOffset = segment.baseOffset,
                    relativePositionInSegment = segment.size)

                // 调用segment的函数添加消息
                // 传入的是这批消息的第一个和最后的offset,最大时间戳和消息对象
                segment.append(firstOffset = appendInfo.firstOffset,
                    largestOffset = appendInfo.lastOffset,
                    largestTimestamp = appendInfo.maxTimestamp,
                    shallowOffsetOfMaxTimestamp = appendInfo.offsetOfMaxTimestamp,
                    records = validRecords)

                ....

                // nextOffsetMetadata = new LogOffsetMetadata(messageOffset, activeSegment.baseOffset, activeSegment.size)
                // 更新nextOffsetMetadata的值,nextOffsetMetadata记录了下一条写入的message的offset
                updateLogEndOffset(appendInfo.lastOffset + 1)

                // 如果没有刷新的消息数大于配置的,那么将消息刷入到磁盘 ,底层调用的是Segment的flush函数
                if (unflushedMessages >= config.flushInterval)
                    flush()

                appendInfo
            }
        } catch {
            ....
        }
    }

analyzeAndValidateRecords

首先对消息进行分析校验计算,主要看一下计算的部分

    private def analyzeAndValidateRecords(records: MemoryRecords, isFromClient: Boolean): LogAppendInfo = {
        var shallowMessageCount = 0
        var validBytesCount = 0
        var firstOffset = -1L
        var lastOffset = -1L
        var sourceCodec: CompressionCodec = NoCompressionCodec
        var maxTimestamp = RecordBatch.NO_TIMESTAMP
        var offsetOfMaxTimestamp = -1L

        for (batch <- records.batches.asScala) {
            ....

            lastOffset = batch.lastOffset

            val batchSize = batch.sizeInBytes

            batch.ensureValid()

            if (batch.maxTimestamp > maxTimestamp) {
                maxTimestamp = batch.maxTimestamp
                offsetOfMaxTimestamp = lastOffset
            }

            shallowMessageCount += 1
            validBytesCount += batchSize

            ....
        }
        ....
        LogAppendInfo(firstOffset, lastOffset, maxTimestamp, offsetOfMaxTimestamp, RecordBatch.NO_TIMESTAMP, sourceCodec,
            targetCodec, shallowMessageCount, validBytesCount, monotonic)
    }

核心计算内容如上,主要计算几个部分:

  • 一批消息中最大的offset
  • 一批消息中最大的时间戳
  • 消息数量
  • 消息字节大小

计算完成后返回一个LogAppendInfo对象

maybeRoll

    private def maybeRoll(messagesSize: Int, maxTimestampInMessages: Long, maxOffsetInMessages: Long): LogSegment = {
        val segment = activeSegment
        val now = time.milliseconds
        val reachedRollMs = segment.timeWaitedForRoll(now, maxTimestampInMessages) > config.segmentMs - segment.rollJitterMs
        if (segment.size > config.segmentSize - messagesSize ||
            (segment.size > 0 && reachedRollMs) ||
            segment.index.isFull || segment.timeIndex.isFull || !segment.canConvertToRelativeOffset(maxOffsetInMessages)) {
            // 为什么是maxOffsetInMessages - Integer.MAX_VALUE?
            roll(maxOffsetInMessages - Integer.MAX_VALUE)
        } else {
            segment
        }
    }

是否需要写入到下一个segment有几个判断条件:

  • segment log的大小已经不够放这次的消息
  • 索引文件满了
  • timeIndex文件满了
  • 消息最大位移和基础位移的差值大于int的最大值

如果不需要那么直接返回当前的activeSegment,需要的话会调用roll函数做具体实现

具体实现代码

roll

    def roll(expectedNextOffset: Long = 0): LogSegment = {
        val start = time.nanoseconds
        lock synchronized {
            val newOffset = math.max(expectedNextOffset, logEndOffset)
            // 获取segment文件,如果已经存在的,则先删除
            val logFile = Log.logFile(dir, newOffset)
            val offsetIdxFile = offsetIndexFile(dir, newOffset)
            val timeIdxFile = timeIndexFile(dir, newOffset)
            val txnIdxFile = transactionIndexFile(dir, newOffset)
            for (file <- List(logFile, offsetIdxFile, timeIdxFile, txnIdxFile) if file.exists) {
                file.delete()
            }

            segments.lastEntry() match {
                case null =>
                case entry => {
                    val seg = entry.getValue
                    seg.onBecomeInactiveSegment()// timeindex写入最大时间戳和offset
                    seg.index.trimToValidSize()// 重新调整index文件大小
                    seg.timeIndex.trimToValidSize() //重新调整timeindex文件大小
                    seg.log.trim() // 调整log文件大小
                }
            }

            ....
            //根据当前的offset,已经配置信息构造一个新的segment
            val segment = new LogSegment(dir,
                startOffset = newOffset,
                indexIntervalBytes = config.indexInterval,
                maxIndexSize = config.maxIndexSize,
                rollJitterMs = config.randomSegmentJitter,
                time = time,
                fileAlreadyExists = false,
                initFileSize = initFileSize,
                preallocate = config.preallocate)
            val prev = addSegment(segment) // 放到segments中
            ....
            // 更新nextOffsetMetadata.
            updateLogEndOffset(nextOffsetMetadata.messageOffset)
            ....
            segment
        }
    }

在之前举的栗子中,238就是这个logEndOffset(expectedNextOffset

读取消息

读取消息主要是read函数

    def read(startOffset: Long, maxLength: Int, maxOffset: Option[Long] = None, minOneMessage: Boolean = false,
             isolationLevel: IsolationLevel): FetchDataInfo = {

        val currentNextOffsetMetadata = nextOffsetMetadata
        val next = currentNextOffsetMetadata.messageOffset
        if (startOffset == next) {
            val abortedTransactions =
                if (isolationLevel == IsolationLevel.READ_COMMITTED) Some(List.empty[AbortedTransaction])
                else None
            return FetchDataInfo(currentNextOffsetMetadata, MemoryRecords.EMPTY, firstEntryIncomplete = false,
                abortedTransactions = abortedTransactions)
        }
        // 返回与小于等于给定startOffset对应的Entry。
        var segmentEntry = segments.floorEntry(startOffset)

        while (segmentEntry != null) {
            val segment = segmentEntry.getValue

            // 最大位置分两种情况:1.从active segment获取消息 2.从非active segment中获取消息
            // 1.即为nextOffsetMetadata.relativePositionInSegment
            // 2.segment的size
            val maxPosition = {
                if (segmentEntry == segments.lastEntry) {// 如果是从active segment中获取消息
                    // nextOffsetMetadata初始化的时候该字段设为segment的size,那么即为最大位置
                    val exposedPos = nextOffsetMetadata.relativePositionInSegment.toLong
                    // 这里再次判断是防止刚刚发生了一次roll,这种情况返回segment的size即为最大位置
                    if (segmentEntry != segments.lastEntry)
                        segment.size
                    else
                        exposedPos
                } else {
                    segment.size
                }
            }
            // 从Segment从读取消息
            val fetchInfo = segment.read(startOffset, maxOffset, maxLength, maxPosition, minOneMessage)
            if (fetchInfo == null) {
                segmentEntry = segments.higherEntry(segmentEntry.getKey)
            } else {
                return isolationLevel match {
                    case IsolationLevel.READ_UNCOMMITTED => fetchInfo
                    case IsolationLevel.READ_COMMITTED => addAbortedTransactions(startOffset, segmentEntry, fetchInfo)
                }
            }
        }

        FetchDataInfo(nextOffsetMetadata, MemoryRecords.EMPTY)
    }

消息刷盘

消息写入的时候其实还没有真正的写入磁盘,这个RocketMq也是类似,需要将缓存(还是PageCache?这块不是非常的了解)刷入磁盘,Log中刷盘是调用的flush函数,传入一个offset,那么会将offset之前的消息全部刷入磁盘

    def flush(): Unit = flush(this.logEndOffset)
    def flush(offset: Long): Unit = {
        if (offset <= this.recoveryPoint)
            return
        // 底层调用Segment进行刷盘
        for (segment <- logSegments(this.recoveryPoint, offset))
            segment.flush()

        producerStateManager.deleteSnapshotsBefore(minSnapshotOffsetToRetain(offset))

        lock synchronized {
            if (offset > this.recoveryPoint) {
                this.recoveryPoint = offset
                lastflushedTime.set(time.milliseconds)
            }
        }
    }

recoveryPoint:这个变量指的是还没有进行刷盘的第一个offset,也就是说,小于recoveryPoint的offset都已经进行过刷盘操作,所以一开始能看到小于等于这个值的直接就返回了,因为已经刷盘过了

Log在一开始就介绍了,他是一个对外提供的抽象,内部管理了很多Segment,那么外部调用Log提供的flush函数,内部肯定是以Segment来进行刷盘操作的,所以会先找出Segemnt,然后才通过其进行刷盘,找出Segment的函数是logSegments

    def logSegments(from: Long, to: Long): Iterable[LogSegment] = {
        lock synchronized {
            val floor = segments.floorKey(from)
            if (floor eq null)
                segments.headMap(to).values.asScala
            else
                segments.subMap(floor, true, to, false).values.asScala
        }
    }

这个函数的作用就是获取offset 包含从 from 到 to-1的一段Segment文件,或者直到文件的结尾(如果to>logEndOffset)
这个函数依赖于ConcurrentSkipListMap的实现,两个key之间的数值集合

Log的几个主要的操作对应的函数分析结束了,其他有兴趣的可以自己去看看

吐槽一下,文章写了好几天了才写完=_=

猜你喜欢

转载自blog.csdn.net/u013160932/article/details/79721699