openstack rocky AIO部署时遇到的问题及源码溯源

环境

ubuntu 18.04.2
python 2.7

问题一

按照官方教程一步步走
https://docs.openstack.org/openstack-ansible/latest/user/aio/quickstart.html

但是到 # scripts/bootstrap-aio.sh就出问题了
错误信息如下：
TASK [bootstrap-host : Fail if there is not enough space available in /] ******* fatal: [localhost]: FAILED! => {"changed": false, "msg": "Not enough space available in /.\nFound 0.0 GB, required 50 GB)\n"}

居然提示内存空间不足。

查看/可用空间，

文件系统            1K-块     已用       可用 已用% 挂载点
udev             32816228        0   32816228    0% /dev
tmpfs             6567676     2368    6565308    1% /run
/dev/nvme0n1p3  490140632 18205492  446967616    4% /
tmpfs            32838364    86376   32751988    1% /dev/shm
tmpfs                5120        4       5116    1% /run/lock
tmpfs            32838364        0   32838364    0% /sys/fs/cgroup
/dev/loop0          35456    35456          0  100% /snap/gtk-common-themes/818
/dev/loop1           4224     4224          0  100% /snap/gnome-calculator/352
/dev/loop2          14976    14976          0  100% /snap/gnome-logs/45
/dev/loop3           3840     3840          0  100% /snap/gnome-system-monitor/70
/dev/loop4          93184    93184          0  100% /snap/core/6350
/dev/loop6          15104    15104          0  100% /snap/gnome-characters/206
/dev/loop7           3840     3840          0  100% /snap/gnome-system-monitor/57
/dev/loop5           1024     1024          0  100% /snap/gnome-logs/57
/dev/loop8          36224    36224          0  100% /snap/gtk-common-themes/1198
/dev/loop9           2304     2304          0  100% /snap/gnome-calculator/260
/dev/loop10         91392    91392          0  100% /snap/core/6673
/dev/nvme0n1p2     594784     6248     588536    2% /boot/efi
/dev/loop12         55040    55040          0  100% /snap/core18/782
/dev/loop11        144128   144128          0  100% /snap/gnome-3-26-1604/82
/dev/loop13        144128   144128          0  100% /snap/gnome-3-26-1604/74
/dev/loop14         13312    13312          0  100% /snap/gnome-characters/139
tmpfs             6567672       16    6567656    1% /run/user/121
tmpfs             6567672       44    6567628    1% /run/user/1000
/dev/sda1       960568328    77888  911626492    1% /usr/local/openstack
/dev/sda2       479567536  3323764  451813336    1% /media/rise/7ab2a651-a610-4ccf-b1db-20030e67fe18
/dev/sda3       479567536    73764  455063336    1% /media/rise/c3ded1c0-8688-485f-97ff-96fcb8c79fb1
/dev/loop15        126976   126976          0  100% /snap/vscode/89
/dev/loop17    1073610752  1105952 1072504800    1% /var/lib/nova/instances
/dev/loop18    1073610752  1105952 1072504800    1% /srv/swift1.img
/dev/loop19    1073610752  1105952 1072504800    1% /srv/swift2.img
/dev/loop20    1073610752  1105952 1072504800    1% /srv/swift3.img
/dev/loop21     489684992    16672  487570432    1% /var/lib/machines
tmpfs             6567672        0    6567672    0% /run/user/0

完全没有问题啊！！！

经过半天的追查，在/opt/openstack-ansible/tests/roles/bootstrap-host/tasks/check-requirements.yml 文件中发现一个有趣的事实。

- name: Fail if there is not enough space available in /
  fail:
    msg: |
      Not enough space available in /.
      Found {
   
   { root_gb_available }} GB, required {
   
   { bootstrap_host_data_disk_min_size }} GB)
  when:
    - bootstrap_host_data_disk_device == None
    - (host_root_space_available_bytes | int) < (host_data_disk_min_size_bytes | int)
  tags:
    - check-disk-size

如上代码所示，直接原因是host_root_space_available_bytes为0，根据下面源码看出这个变量又依赖root_space_available。

# Convert root_space_available to bytes.
- name: Set root disk facts
  set_fact:
    host_root_space_available_bytes: "{
   
   { ( root_space_available.stdout | int) * 1024 | int }}"
  when:
    - bootstrap_host_data_disk_device == None
  tags:
    - check-disk-size

而root_space_available又来自一个register，看下面源码：

- name: Identify the space available in /
  # NOTE(hwoarang): df does not work reliably on btrfs filesystems
  # https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_free_space_do_I_have.3F
  # As such, use the btrfs tools to determine the real available size on the
  # disk
  shell: |
    if [[ $(df -T / | tail -n 1 | awk '{print $2}') == "btrfs" ]]; then
        btrfs fi usage --kbytes / | awk '/^.*Free / {print $3}'| sed 's/\..*//'
    else
        df -BK / | awk '!/^Filesystem/ {print $4}' | sed 's/K//' 
    fi
  when:
    - bootstrap_host_data_disk_device == None
  changed_when: false
  register: root_space_available
  tags:
    - check-disk-size

中间那个shell执行因为我的环境不是btrfs文件系统，所以进入else，重点是我环境是中文的，所以这个else的输出是

# df -BK / | awk '!/^Filesystem/ {print $4}' | sed 's/K//' 
可用
446940080

有没有搞错啊！！！居然还有中文和换行，所以解决方法也简单
所以要把这一行做个调整
df -BK / | awk '!/^Filesystem/ {print $4}' | sed 's/K//' | sed -n "2p"

OK！

问题二

TASK [lxc_hosts : Ensure that the LXC cache has been prepared] **************************************************************************
task path: /etc/ansible/roles/lxc_hosts/tasks/lxc_cache_preparation.yml:137
<aio1> The "physical_host" variable of "aio1" has been found to have a corresponding host entry in inventory.
<aio1> The "physical_host" variable of "aio1" terminates at "172.29.236.100" using the host variable "ansible_host".
FAILED - RETRYING: Ensure that the LXC cache has been prepared (120 retries left).
FAILED - RETRYING: Ensure that the LXC cache has been prepared (119 retries left).
FAILED - RETRYING: Ensure that the LXC cache has been prepared (118 retries left).
FAILED - RETRYING: Ensure that the LXC cache has been prepared (117 retries left).
FAILED - RETRYING: Ensure that the LXC cache has been prepared (116 retries left).
FAILED - RETRYING: Ensure that the LXC cache has been prepared (115 retries left).
FAILED - RETRYING: Ensure that the LXC cache has been prepared (114 retries left).
fatal: [aio1]: FAILED! => {"ansible_job_id": "616180643313.20921", "attempts": 8, "changed": true, "cmd": "chroot /var/lib/machines/ubuntu-bionic-amd64 /opt/cache-prep-commands.sh > /var/log/lxc-cache-prep-commands.log 2>&1", "delta": "0:01:09.560041", "end": "2019-04-04 10:55:54.090647", "finished": 1, "msg": "non-zero return code", "rc": 100, "start": "2019-04-04 10:54:44.530606", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

这里截取部分日志，打印结果显示错误出现在/etc/ansible/roles/lxc_hosts/tasks/lxc_cache_preparation.yml:137，查看源码，如下：

- name: Ensure that the LXC cache has been prepared
  async_status:
    jid: "{
   
   { _lxc_cache_prepare_commands.ansible_job_id }}"
  register: _lxc_cache_prepare_commands_result
  until: _lxc_cache_prepare_commands_result.finished
  delay: 10
  retries: "{
   
   { lxc_cache_prep_timeout | int // 10 }}"

初步结论是job id超时出错，将delay设置为20，再次运行，解决了。

openstack rocky AIO部署时遇到的问题及源码溯源

环境

问题一

问题二

猜你喜欢