python linecache读取过程

最近使用Python编写日志处理脚本时，对Python的几种读取文件的方式进行了实验。其中，linecache的行为引起了我的注意。
Python按行读取文件的经典方式有以下几种：

with open('blabla.log', 'r') as f:
    for line in f.readlines():
        ## do something

with open('blabla.log', 'r') as f:
    for line in f:
      ## do something

with open('blabla.log', 'r') as f:
    while 1:
        line = f.readline()
        if not line:
          break
        ## do something

以上几种方式都不支持对于文件按行随机访问。在这样的背景下，能够支持访直接访问某一行内容的linecache模块是一种很好的补充。
我们可以使用linecache模块的getline方法访问某一具体行的内容，官方文档中给出了如下用法：

>>> import linecache
>>> linecache.getline('/etc/passwd', 4)

在使用过程中我注意到，基于linecache的getline方法的日志分析会在跑满CPU资源之前首先占用大量内存空间，也就是在CPU使用率仍然很低的情况下，内存空间就会被迅速地消耗。
这一现象引起了我的兴趣。我猜测linecache在随机读取文件时，是首先依序将文件读入内存，之后寻找所要定位的行是否在内存当中。若不在，则进行相应的替换行为，直至寻找到所对应的行，再将其返回。
对linecache代码的阅读证实了这一想法。
在linecache.py中，我们可以看到getline的定义为：

def getline(filename, lineno, module_globals=None):
    lines = getlines(filename, module_globals)
    if 1 <= lineno <= len(lines):
        return lines[lineno-1]
    else:
        return ''

不难看出，getline方法通过getlines得到了文件行的List，以此来实现对于文件行的随机读取。继续查看getlines的定义。

def getlines(filename, module_globals=None):
    """Get the lines for a file from the cache.
    Update the cache if it doesn't contain an entry for this file already."""

    if filename in cache:
        return cache[filename][2]
    else:
        return updatecache(filename, module_globals)

由此可见，getlines方法会首先确认文件是否在缓存当中，如果在则返回该文件的行的List，否则执行updatecache方法，对缓存内容进行更新。因此，在程序启动阶段，linecache不得不首先占用内存对文件进行缓存，才能进行后续的读取操作。
而在updatecache方法中，我们可以看到一个有趣的事实是：

def updatecache(filename, module_globals=None):
    """Update a cache entry and return its list of lines.
    If something's wrong, print a message, discard the cache entry,
    and return an empty list."""

    ## ... 省略...

    try:
        fp = open(fullname, 'rU')
        lines = fp.readlines()
        fp.close()
    except IOError, msg:
##      print '*** Cannot open', fullname, ':', msg
        return []
    if lines and not lines[-1].endswith('\n'):
        lines[-1] += '\n'
    size, mtime = stat.st_size, stat.st_mtime
    cache[filename] = size, mtime, lines, fullname
    return lines

也就是说，linecache依然借助了文件对象的readlines方法。这也给了我们一个提示，当文件很大不适用readlines方法直接获取行的List进行读取解析时，linecache似乎也并不会成为一个很好的选择。

python linecache读取过程

猜你喜欢