Log是对segment的抽象,对多个segment的封装,外部只需要操作Log就可以了,不需要考虑往哪个segment读,往哪个segment写,内部会处理这些细节
看下Log的几个核心变量和函数:
@volatile private var nextOffsetMetadata: LogOffsetMetadata = _
private val segments: ConcurrentNavigableMap[java.lang.Long, LogSegment] = new ConcurrentSkipListMap[java.lang.Long, LogSegment]
val topicPartition: TopicPartition = Log.parseTopicPartitionName(dir)
private val tags = Map("topic" -> topicPartition.topic, "partition" -> topicPartition.partition.toString)
@volatile var dir: File
def logEndOffset: Long = nextOffsetMetadata.messageOffset
- segments:segment文件的集合,使用了ConcurrentSkipListMap类型(原理可以另行百度,这里不做详细分析)。key为baseOffset,即系列2里说的,例如000.log/index则baseOffset=0,238.log/index则baseOffset=238;value为Segment(后续分析)
- nextOffsetMetadata:主要记录message的offset,Segment的baseOffset,Segment的log的大小
- dir:Segment文件在磁盘上对应位置的File对象,即xxx/topic-artition/
- logEndOffset:nextOffsetMetadata的messageOffset的属性,即下一个消息的offset,也可以当做是当前最后的一个offset
segments的初始化
for (file <- dir.listFiles if file.isFile) {
val filename = file.getName
if (filename.endsWith(IndexFileSuffix) || filename.endsWith(TimeIndexFileSuffix)) {
//segment index文件初始化....
} else if (filename.endsWith(LogFileSuffix)) {
//segment log文件初始化....
segments.put(start, segment)
}
}
启动的时候对partition下的segments进行遍历,填充segments
index初始化
val logFile =
if (filename.endsWith(TimeIndexFileSuffix))
new File(file.getAbsolutePath.replace(TimeIndexFileSuffix, LogFileSuffix))
else
new File(file.getAbsolutePath.replace(IndexFileSuffix, LogFileSuffix))
if (!logFile.exists) {
warn("Found an orphaned index file, %s, with no corresponding log file.".format(file.getAbsolutePath))
file.delete()
}
没有做什么事情,主要是确保index对于的log文件存在
log初始化
//取得baseOffset,即238.log为238
val start = filename.substring(0, filename.length - LogFileSuffix.length).toLong
val indexFile = Log.indexFilename(dir, start)
val timeIndexFile = Log.timeIndexFilename(dir, start)
val indexFileExists = indexFile.exists()
val timeIndexFileExists = timeIndexFile.exists()
//初始化Segment
val segment = new LogSegment(dir = dir,startOffset = start,
indexIntervalBytes = config.indexInterval,
maxIndexSize = config.maxIndexSize,
rollJitterMs = config.randomSegmentJitter,
time = time, fileAlreadyExists = true)
if (indexFileExists) {
try {//索引文件检查
segment.index.sanityCheck()
if (!timeIndexFileExists)
segment.timeIndex.resize(0)
segment.timeIndex.sanityCheck()
} catch {
//....
}
} else {
error("Could not find index file corresponding to log file %s, rebuilding index...".format(segment.log.file.getAbsolutePath))
segment.recover(config.maxMessageSize)
}
segments.put(start, segment)
start就是baseOffset,即上面说的
例如000.log/index则baseOffset=0,238.log/index则baseOffset=238
nextOffsetMetadata初始化
locally {
val startMs = time.milliseconds
loadSegments()
// def activeSegment = segments.lastEntry.getValue
// 即segmnents最后的一个元素,即当前使用的Segment
// LogSegment后面分析,nextOffsetMetadata记录了下一条消息的offset,Segment的baseOffset,Segment的log的大小
nextOffsetMetadata = new LogOffsetMetadata(activeSegment.nextOffset, activeSegment.baseOffset, activeSegment.size)
leaderEpochCache.clearAndFlushLatest(nextOffsetMetadata.messageOffset)
logStartOffset = math.max(logStartOffset, segments.firstEntry().getValue.baseOffset)
leaderEpochCache.clearAndFlushEarliest(logStartOffset)
loadProducerState(logEndOffset, reloadFromCleanShutdown = hasCleanShutdownFile)
}
//
private def loadSegments() {
....
loadSegmentFiles()//上面说的segments的初始化
....
}
添加消息
private def append(records: MemoryRecords, isFromClient: Boolean, assignOffsets: Boolean, leaderEpoch: Int): LogAppendInfo = {
//对这批消息进行计算校验分析,返回结果
val appendInfo = analyzeAndValidateRecords(records, isFromClient = isFromClient)
....
try {
lock synchronized {
if (assignOffsets) {
// 从nextOffsetMetadata获取offset,在初始化的时候知道了,nextOffsetMetadata保存着activeSegment下一条消息的offset
val offset = new LongRef(nextOffsetMetadata.messageOffset)
appendInfo.firstOffset = offset.value//这批消息的起始值从nextOffsetMetadata获取,即当前Segment当前添加消息后的nextoffset
val now = time.milliseconds
val validateAndOffsetAssignResult = try {
LogValidator.validateMessagesAndAssignOffsets(validRecords,
offset,
now,
appendInfo.sourceCodec,
appendInfo.targetCodec,
config.compact,
config.messageFormatVersion.messageFormatVersion,
config.messageTimestampType,
config.messageTimestampDifferenceMaxMs,
leaderEpoch,
isFromClient)
} catch {
....
}
// 计算
validRecords = validateAndOffsetAssignResult.validatedRecords
appendInfo.maxTimestamp = validateAndOffsetAssignResult.maxTimestamp
appendInfo.offsetOfMaxTimestamp = validateAndOffsetAssignResult.shallowOffsetOfMaxTimestamp
appendInfo.lastOffset = offset.value - 1
if (config.messageTimestampType == TimestampType.LOG_APPEND_TIME)
appendInfo.logAppendTime = now//记录写入log的时间
....
} else {
// we are taking the offsets we are given
if (!appendInfo.offsetsMonotonic || appendInfo.firstOffset < nextOffsetMetadata.messageOffset)
throw new IllegalArgumentException("Out of order offsets found in " + records.records.asScala.map(_.offset))
}
....
// 判断segment是否已经满了,如果满了需要写入到下一个segment
val segment = maybeRoll(messagesSize = validRecords.sizeInBytes,
maxTimestampInMessages = appendInfo.maxTimestamp,
maxOffsetInMessages = appendInfo.lastOffset)
val logOffsetMetadata = LogOffsetMetadata(
messageOffset = appendInfo.firstOffset,
segmentBaseOffset = segment.baseOffset,
relativePositionInSegment = segment.size)
// 调用segment的函数添加消息
// 传入的是这批消息的第一个和最后的offset,最大时间戳和消息对象
segment.append(firstOffset = appendInfo.firstOffset,
largestOffset = appendInfo.lastOffset,
largestTimestamp = appendInfo.maxTimestamp,
shallowOffsetOfMaxTimestamp = appendInfo.offsetOfMaxTimestamp,
records = validRecords)
....
// nextOffsetMetadata = new LogOffsetMetadata(messageOffset, activeSegment.baseOffset, activeSegment.size)
// 更新nextOffsetMetadata的值,nextOffsetMetadata记录了下一条写入的message的offset
updateLogEndOffset(appendInfo.lastOffset + 1)
// 如果没有刷新的消息数大于配置的,那么将消息刷入到磁盘 ,底层调用的是Segment的flush函数
if (unflushedMessages >= config.flushInterval)
flush()
appendInfo
}
} catch {
....
}
}
analyzeAndValidateRecords
首先对消息进行分析校验计算,主要看一下计算的部分
private def analyzeAndValidateRecords(records: MemoryRecords, isFromClient: Boolean): LogAppendInfo = {
var shallowMessageCount = 0
var validBytesCount = 0
var firstOffset = -1L
var lastOffset = -1L
var sourceCodec: CompressionCodec = NoCompressionCodec
var maxTimestamp = RecordBatch.NO_TIMESTAMP
var offsetOfMaxTimestamp = -1L
for (batch <- records.batches.asScala) {
....
lastOffset = batch.lastOffset
val batchSize = batch.sizeInBytes
batch.ensureValid()
if (batch.maxTimestamp > maxTimestamp) {
maxTimestamp = batch.maxTimestamp
offsetOfMaxTimestamp = lastOffset
}
shallowMessageCount += 1
validBytesCount += batchSize
....
}
....
LogAppendInfo(firstOffset, lastOffset, maxTimestamp, offsetOfMaxTimestamp, RecordBatch.NO_TIMESTAMP, sourceCodec,
targetCodec, shallowMessageCount, validBytesCount, monotonic)
}
核心计算内容如上,主要计算几个部分:
- 一批消息中最大的offset
- 一批消息中最大的时间戳
- 消息数量
- 消息字节大小
计算完成后返回一个LogAppendInfo对象
maybeRoll
private def maybeRoll(messagesSize: Int, maxTimestampInMessages: Long, maxOffsetInMessages: Long): LogSegment = {
val segment = activeSegment
val now = time.milliseconds
val reachedRollMs = segment.timeWaitedForRoll(now, maxTimestampInMessages) > config.segmentMs - segment.rollJitterMs
if (segment.size > config.segmentSize - messagesSize ||
(segment.size > 0 && reachedRollMs) ||
segment.index.isFull || segment.timeIndex.isFull || !segment.canConvertToRelativeOffset(maxOffsetInMessages)) {
// 为什么是maxOffsetInMessages - Integer.MAX_VALUE?
roll(maxOffsetInMessages - Integer.MAX_VALUE)
} else {
segment
}
}
是否需要写入到下一个segment有几个判断条件:
- segment log的大小已经不够放这次的消息
- 索引文件满了
- timeIndex文件满了
- 消息最大位移和基础位移的差值大于int的最大值
如果不需要那么直接返回当前的activeSegment,需要的话会调用roll函数做具体实现
具体实现代码
roll
def roll(expectedNextOffset: Long = 0): LogSegment = {
val start = time.nanoseconds
lock synchronized {
val newOffset = math.max(expectedNextOffset, logEndOffset)
// 获取segment文件,如果已经存在的,则先删除
val logFile = Log.logFile(dir, newOffset)
val offsetIdxFile = offsetIndexFile(dir, newOffset)
val timeIdxFile = timeIndexFile(dir, newOffset)
val txnIdxFile = transactionIndexFile(dir, newOffset)
for (file <- List(logFile, offsetIdxFile, timeIdxFile, txnIdxFile) if file.exists) {
file.delete()
}
segments.lastEntry() match {
case null =>
case entry => {
val seg = entry.getValue
seg.onBecomeInactiveSegment()// timeindex写入最大时间戳和offset
seg.index.trimToValidSize()// 重新调整index文件大小
seg.timeIndex.trimToValidSize() //重新调整timeindex文件大小
seg.log.trim() // 调整log文件大小
}
}
....
//根据当前的offset,已经配置信息构造一个新的segment
val segment = new LogSegment(dir,
startOffset = newOffset,
indexIntervalBytes = config.indexInterval,
maxIndexSize = config.maxIndexSize,
rollJitterMs = config.randomSegmentJitter,
time = time,
fileAlreadyExists = false,
initFileSize = initFileSize,
preallocate = config.preallocate)
val prev = addSegment(segment) // 放到segments中
....
// 更新nextOffsetMetadata.
updateLogEndOffset(nextOffsetMetadata.messageOffset)
....
segment
}
}
在之前举的栗子中,238就是这个logEndOffset(expectedNextOffset
读取消息
读取消息主要是read函数
def read(startOffset: Long, maxLength: Int, maxOffset: Option[Long] = None, minOneMessage: Boolean = false,
isolationLevel: IsolationLevel): FetchDataInfo = {
val currentNextOffsetMetadata = nextOffsetMetadata
val next = currentNextOffsetMetadata.messageOffset
if (startOffset == next) {
val abortedTransactions =
if (isolationLevel == IsolationLevel.READ_COMMITTED) Some(List.empty[AbortedTransaction])
else None
return FetchDataInfo(currentNextOffsetMetadata, MemoryRecords.EMPTY, firstEntryIncomplete = false,
abortedTransactions = abortedTransactions)
}
// 返回与小于等于给定startOffset对应的Entry。
var segmentEntry = segments.floorEntry(startOffset)
while (segmentEntry != null) {
val segment = segmentEntry.getValue
// 最大位置分两种情况:1.从active segment获取消息 2.从非active segment中获取消息
// 1.即为nextOffsetMetadata.relativePositionInSegment
// 2.segment的size
val maxPosition = {
if (segmentEntry == segments.lastEntry) {// 如果是从active segment中获取消息
// nextOffsetMetadata初始化的时候该字段设为segment的size,那么即为最大位置
val exposedPos = nextOffsetMetadata.relativePositionInSegment.toLong
// 这里再次判断是防止刚刚发生了一次roll,这种情况返回segment的size即为最大位置
if (segmentEntry != segments.lastEntry)
segment.size
else
exposedPos
} else {
segment.size
}
}
// 从Segment从读取消息
val fetchInfo = segment.read(startOffset, maxOffset, maxLength, maxPosition, minOneMessage)
if (fetchInfo == null) {
segmentEntry = segments.higherEntry(segmentEntry.getKey)
} else {
return isolationLevel match {
case IsolationLevel.READ_UNCOMMITTED => fetchInfo
case IsolationLevel.READ_COMMITTED => addAbortedTransactions(startOffset, segmentEntry, fetchInfo)
}
}
}
FetchDataInfo(nextOffsetMetadata, MemoryRecords.EMPTY)
}
消息刷盘
消息写入的时候其实还没有真正的写入磁盘,这个RocketMq也是类似,需要将缓存(还是PageCache?这块不是非常的了解)刷入磁盘,Log中刷盘是调用的flush函数,传入一个offset,那么会将offset之前的消息全部刷入磁盘
def flush(): Unit = flush(this.logEndOffset)
def flush(offset: Long): Unit = {
if (offset <= this.recoveryPoint)
return
// 底层调用Segment进行刷盘
for (segment <- logSegments(this.recoveryPoint, offset))
segment.flush()
producerStateManager.deleteSnapshotsBefore(minSnapshotOffsetToRetain(offset))
lock synchronized {
if (offset > this.recoveryPoint) {
this.recoveryPoint = offset
lastflushedTime.set(time.milliseconds)
}
}
}
recoveryPoint:这个变量指的是还没有进行刷盘的第一个offset,也就是说,小于recoveryPoint的offset都已经进行过刷盘操作,所以一开始能看到小于等于这个值的直接就返回了,因为已经刷盘过了
Log在一开始就介绍了,他是一个对外提供的抽象,内部管理了很多Segment,那么外部调用Log提供的flush函数,内部肯定是以Segment来进行刷盘操作的,所以会先找出Segemnt,然后才通过其进行刷盘,找出Segment的函数是logSegments
def logSegments(from: Long, to: Long): Iterable[LogSegment] = {
lock synchronized {
val floor = segments.floorKey(from)
if (floor eq null)
segments.headMap(to).values.asScala
else
segments.subMap(floor, true, to, false).values.asScala
}
}
这个函数的作用就是获取offset 包含从 from 到 to-1的一段Segment文件,或者直到文件的结尾(如果to>logEndOffset)
这个函数依赖于ConcurrentSkipListMap的实现,两个key之间的数值集合
Log的几个主要的操作对应的函数分析结束了,其他有兴趣的可以自己去看看
吐槽一下,文章写了好几天了才写完=_=