Linux性能优化-磁盘I/O延迟很高

目录

安装环境

分析问题


安装环境

安装 bcc,docker
启动docker

service docker start


运行环境如下

docker中有三个文件分别如下
io_app.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import os
import uuid
import random
import shutil
from concurrent.futures import ThreadPoolExecutor
from flask import Flask, jsonify

app = Flask(__name__)


def validate(word, sentence):
    return word in sentence


def generate_article():
    s_nouns = [
        "A dude", "My mom", "The king", "Some guy", "A cat with rabies",
        "A sloth", "Your homie", "This cool guy my gardener met yesterday",
        "Superman"
    ]
    p_nouns = [
        "These dudes", "Both of my moms", "All the kings of the world",
        "Some guys", "All of a cattery's cats",
        "The multitude of sloths living under your bed", "Your homies",
        "Like, these, like, all these people", "Supermen"
    ]
    s_verbs = [
        "eats", "kicks", "gives", "treats", "meets with", "creates", "hacks",
        "configures", "spies on", "retards", "meows on", "flees from",
        "tries to automate", "explodes"
    ]
    infinitives = [
        "to make a pie.", "for no apparent reason.",
        "because the sky is green.", "for a disease.",
        "to be able to make toast explode.", "to know more about archeology."
    ]
    sentence = '{} {} {} {}'.format(
        random.choice(s_nouns), random.choice(s_verbs),
        random.choice(s_nouns).lower() or random.choice(p_nouns).lower(),
        random.choice(infinitives))
    return '\n'.join([sentence for i in range(50000)])


@app.route('/')
def hello_world():
    return 'hello world'


@app.route("/popularity/<word>")
def word_popularity(word):
    dir_path = '/tmp/{}'.format(uuid.uuid1())
    count = 0
    sample_size = 1000

    def save_to_file(file_name, content):
        with open(file_name, 'w') as f:
            f.write(content)

    try:
        # initial directory firstly
        os.mkdir(dir_path)

        # save article to files
        for i in range(sample_size):
            file_name = '{}/{}.txt'.format(dir_path, i)
            article = generate_article()
            save_to_file(file_name, article)

        # count word popularity
        for root, dirs, files in os.walk(dir_path):
            for file_name in files:
                with open('{}/{}'.format(dir_path, file_name)) as f:
                    if validate(word, f.read()):
                        count += 1
    finally:
        # clean files
        shutil.rmtree(dir_path, ignore_errors=True)

    return jsonify({'popularity': count / sample_size * 100, 'word': word})


@app.route("/popular/<word>")
def word_popular(word):
    count = 0
    sample_size = 1000
    articles = []

    try:
        for i in range(sample_size):
            articles.append(generate_article())

        for article in articles:
            if validate(word, article):
                count += 1
    finally:
        pass

    return jsonify({'popularity': count / sample_size * 100, 'word': word})


if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=80)

Dockerfile 如下

FROM python:alpine

LABEL maintainer="[email protected]"

RUN pip install flask

EXPOSE 80
ADD io_app.py /io_app.py

Makefile 如下

.PHONY: run
run:
        docker run --name=io_app -p 10000:80 -itd feisky/word-pop


.PHONY: build
build:
        docker build -t feisky/word-pop -f Dockerfile .


.PHONY: push
push:
        docker push feisky/word-pop


.PHONY: clean
clean:
        docker rm -f io_app

执行前的一些准备工作

# 构建 docker镜像
make build

# 运行案列
make run

#查看docker运行情况
docker ps
CONTAINER ID        IMAGE               COMMAND               CREATED             STATUS              PORTS                   NAMES
88303172b050        feisky/word-pop     "python /io_app.py"   1 hours ago         Up 1 hours          0.0.0.0:10000->80/tcp   io_app


测试运行情况
curl http://[IP]:10000/
hello world

分析问题

为了避免执行结果瞬间就结束了,把调用结果放到一个循环中

while true; do time curl curl http://[IP]:10000/popularity/word; sleep 1; done

#如果执行一次,结果如下
{
  "popularity": 0.0, 
  "word": "word"
}

通过top观察结果,iowait比较高

top

top - 17:53:37 up 12 days,  7:36,  6 users,  load average: 0.65, 0.16, 0.05
Tasks:  90 total,   2 running,  53 sleeping,   0 stopped,   0 zombie
%Cpu(s): 20.1 us, 21.1 sy,  0.0 ni,  0.0 id, 58.7 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1008936 total,    73040 free,   131788 used,   804108 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   718136 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                             
27110 root      20   0  103304  21636   2600 S 38.2  2.1   1:29.64 python                                                                                              
   34 root      20   0       0      0      0 S  2.0  0.0   0:01.79 kswapd0                                                                                             
   10 root      20   0       0      0      0 R  0.3  0.0   0:51.73 rcu_sched                                                                                           

从iostat输出看,I/O使用率已经到100%了,写请求响应时间都是1秒

iostat -x -d 1
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    14.74    0.00  111.58     0.00 116517.89  2088.53   122.98 1064.30    0.00 1064.30   9.42 105.05

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    30.30    0.00  111.11     0.00 111385.86  2004.95    78.76 1078.08    0.00 1078.08   9.05 100.51
00

通过pidstat 看,就是python进程导致的

pidstat -d 1
06:06:30 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
06:06:31 PM     0       349      0.00    149.49      0.00  jbd2/vda1-8
06:06:31 PM     0     27110      0.00   5886.87      0.00  python

06:06:31 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
06:06:32 PM     0     27110      0.00 152249.50      0.00  python

通过strace看,只有一堆stat()函数

strace -p 27110
stat("/usr/local/lib/python3.7/site-packages/itsdangerous/serializer.py", {st_mode=S_IFREG|0644, st_size=8653, ...}) = 0
。。。
stat("/usr/local/lib/python3.7/site-packages/itsdangerous/signer.py", {st_mode=S_IFREG|0644, st_size=6345, ...}) = 0
。。。
stat("/usr/local/lib/python3.7/site-packages/itsdangerous/timed.py", {st_mode=S_IFREG|0644, st_size=5635, ...}) = 0

再通过strace观察,这次只观察子进程,而且范围缩小,只跟踪文件系统调用相关的函数
通过trace=open,就非常明显的看到了,python进程在不断创建临时文件

strace -p 27110 -ff -e trace=desc
[pid 27651] ioctl(6, TIOCGWINSZ, 0x7fd244c6eef0) = -1 ENOTTY (Inappropriate ioctl for device)
[pid 27651] lseek(6, 0, SEEK_CUR)       = 0
[pid 27651] mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd243fd6000
[pid 27651] write(6, "A cat with rabies meets with thi"..., 4199999) = 4199999
[pid 27651] close(6)                    = 0
[pid 27651] mmap(NULL, 253952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd244af1000
[pid 27651] mmap(NULL, 3153920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd2440d6000
[pid 27651] open("/tmp/e6bc7e84-1d65-11e9-b8e3-0242ac120002/446.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27651] fcntl(6, F_SETFD, FD_CLOEXEC) = 0


strace -p 27110 -ff -e trace=open
strace: Process 27110 attached with 3 threads
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/245.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/246.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/247.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/248.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/249.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/250.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6
[pid 27669] open("/tmp/bd006d14-1d68-11e9-b8e3-0242ac120002/251.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 6

通过bcc的工具观察


filetop -C 
TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
2079   AliYunDun        4      0      31      0       R stat
27679  filetop          2      0      2       0       R loadavg
27681  python           0      1      0       3173    R 362.txt
27681  python           0      1      0       2978    R 359.txt
27681  python           0      1      0       2343    R 363.txt
27681  python           0      1      0       2929    R 361.txt
27681  python           0      1      0       2685    R 356.txt
27681  python           0      1      0       2734    R 355.txt
27681  python           0      1      0       2490    R 360.txt
27681  python           0      1      0       3759    R 358.txt
27681  python           0      1      0       3124    R 357.txt

18:44:53 loadavg: 0.45 0.25 0.21 6/156 27681

TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
2079   AliYunDun        4      0      31      0       R stat
27679  filetop          2      0      2       0       R loadavg
27681  python           0      1      0       1757    R 402.txt
27681  python           0      1      0       3027    R 373.txt
27681  python           0      1      0       3076    R 404.txt
27681  python           0      1      0       2685    R 414.txt
27681  python           0      1      0       3955    R 392.txt
27681  python           0      1      0       2539    R 388.txt
27681  python           0      1      0       2490    R 403.txt
27681  python           0      1      0       3271    R 396.txt
27681  python           0      1      0       2539    R 397.txt
27681  python           0      1      0       2880    R 368.txt
27681  python           0      1      0       2587    R 367.txt


#查看27681对应的进程
ps -efT | grep 27681
root     27110 27681 27090 35 18:44 pts/2    00:00:17 /usr/local/bin/python /io_app.py
root     27683 27683 27464  0 18:45 pts/5    00:00:00 grep --color=auto 27681



opensnoop
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/245.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/246.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/247.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/248.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/249.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/250.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/251.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/252.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/253.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/254.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/255.txt
27110  python              6   0 /tmp/c4411dc0-1d69-11e9-b8e3-0242ac120002/256.txt

源码中可以看到,这个案例应用,在每个请求的处理过程中,都会生成一批临时文件,然后读入内存处理,最后再把整个目录删除掉。
这是一种常见的利用磁盘空间处理大量数据的技巧,不过,本次案例中的 I/O 请求太重,导致磁盘 I/O 利用率过高。
要解决这一点,其实就是算法优化问题了。比如在内存充足时,就可以把所有数据都放到内存中处理,这样就能避免 I/O 的性能问题。
当然,这只是优化的第一步,并且方法也不算完善,还可以做进一步的优化。不过,在实际系统中,我们大都是类似的做法,先用最简单的方法,
尽早解决线上问题,然后再继续思考更好的优化方法。
改为基于内存的方式

time curl http://[IP]:10000/popular/word
curl: (52) Empty reply from server

real    0m29.176s
user    0m0.002s
sys     0m0.035s

这是一个响应过慢的单词热度案例。
首先,用 top、iostat,分析了系统的 CPU 和磁盘使用情况。发现了磁盘 I/O 瓶颈,也知道了这个瓶颈是案例应用导致的。
接着,用 strace 来观察进程的系统调用,不过这次很不走运,没找到任何 write 系统调用。
再用strce -ff -e strace=open,就能很明显的发现问题原因了
借助动态追踪工具包 bcc 中的 filetop 和 opensnoop ,发现这个根源是大量读写临时文件。
找出问题后,优化方法就相对比较简单了。如果内存充足时,最简单的方法,就是把数据都放在速度更快的内存中,这样就没有磁盘 I/O 的瓶颈了。当然,再进一步,你可以还可以利用 Trie 树等各种算法,进一步优化单词处理的效率。

猜你喜欢

转载自blog.csdn.net/hixiaoxiaoniao/article/details/86580128