Python traverses the web directory and downloads the files that meet the conditions

The title is a bit exaggerated. My ultimate goal is actually to download all the src.rpm source packages of centos7.6.1810 from the specified webpage with Python (I have looked for centos mirrors, all without exception, there is no source package directory, which feels very inhumane, and The source code files on the webpage are not unified under one directory, and it seems unrealistic to download so many source code packages manually. Unlike openEuler, there is at least one https://repo.openeuler.org/openEuler-20.03-LTS-SP1/ ISO/source/  source package mirror address link).

This is also the cause of the matter. Haha, this topic may not be universal (I should also put a label for my own use^_^), but it just happened to be one of the problems I encountered recently, so I just wrote about it.

The starting url address I can get is this: http://mirror.nsc.liu.se/centos-store/7.6.1810/. It looks like the following when opened with a browser (you can see the word lighttpd in the lower left corner, so I think it should be similar to the principle of nginx index directory listing providing users to download files) .

It seems that you can use the lftp http://mirror.nsc.liu.se/centos-store/7.6.1810/ command to enter (but this command is not used much, it is embarrassing, and I will study it later, maybe it has a strange effect) .

[root@localhost ttt]# lftp http://mirror.nsc.liu.se/centos-store/7.6.1810/
cd ok, cwd=/centos-store/7.6.1810                                     
lftp mirror.nsc.liu.se:/centos-store/7.6.1810> ls
drwxr-xr-x  --  ..                   
drwxr-xr-x               2018-11-29 00:58  atomic
drwxr-xr-x               2018-11-29 16:54  centosplus
drwxr-xr-x               2018-11-28 23:59  cloud
drwxr-xr-x               2018-11-29 00:59  configmanagement
drwxr-xr-x               2018-12-02 15:34  cr
drwxr-xr-x               2017-09-29 14:33  dotnet
drwxr-xr-x               2018-11-29 16:55  extras
drwxr-xr-x               2017-09-01 13:08  fasttrack
drwxr-xr-x               2018-11-27 09:05  isos
drwxr-xr-x               2018-11-29 00:59  nfv
drwxr-xr-x               2018-11-29 00:59  opstools
drwxr-xr-x               2018-12-10 22:51  os
drwxr-xr-x               2018-11-29 00:58  paas
drwxr-xr-x               2017-02-10 22:18  rt
drwxr-xr-x               2018-11-29 00:56  sclo
drwxr-xr-x               2018-11-29 00:58  storage
drwxr-xr-x               2018-11-29 16:57  updates
drwxr-xr-x               2018-11-29 00:58  virt

So starting from this directory, I started recursively searching and downloading the *.src.rpm source code package file I needed (the efficiency is not too high, I feel ashamed).

import os
import re

import requests


def load_url_data(url):
    """
    从url页面中提取并下载 src.rpm 源码包
    """
    r = requests.get(url)
    raw_list = re.compile(r'<a.*?>(.*?)</a>').finditer(r.text.strip())
    for i in raw_list:
        x = i.group(1)
        if x.endswith('.src.rpm'):
            # src_rpm = os.path.join(url, x)
            # 没使用 os.path.join 是因为在 Windows 环境下拼接的路径有问题
            src_rpm = '/'.join([url, x])
            print(src_rpm)
            if not os.path.exists(x):
                os.system('wget %s' % src_rpm)
            else:
                print('already downloaded %s' % x)
        elif '.' in x or 'x86_64' in x:
            # 由于对所有除了 .src.rpm 的其他文件我都不关心,所以直接略过
            # x86_64 这个目录主要是放二进制包,我不太需要,所以碰到以后直接略过
            pass
        else:
            sub_url = '/'.join([url, x])
            print(f'scanning {sub_url} ...')
            load_url_data(sub_url)


if __name__ == '__main__':
    # centos_url = 'https://vault.centos.org/7.6.1810/'
    centos_url = 'http://mirror.nsc.liu.se/centos-store/7.6.1810/'
    load_url_data(centos_url)

The output is similar to the following (a single download is also easy to get stuck, which may have some relationship with the network speed):

[root@localhost centos7.1810_src_packages]# python3 test.py 
scanning http://mirror.nsc.liu.se/centos-store/7.6.1810//atomic ...
scanning http://mirror.nsc.liu.se/centos-store/7.6.1810//atomic/Source ...
scanning http://mirror.nsc.liu.se/centos-store/7.6.1810//centosplus ...
scanning http://mirror.nsc.liu.se/centos-store/7.6.1810//centosplus/Source ...
scanning http://mirror.nsc.liu.se/centos-store/7.6.1810//centosplus/Source/SPackages ...
http://mirror.nsc.liu.se/centos-store/7.6.1810//centosplus/Source/SPackages/kernel-plus-3.10.0-957.1.3.el7.centos.plus.src.rpm
--2021-03-30 16:52:16--  http://mirror.nsc.liu.se/centos-store/7.6.1810//centosplus/Source/SPackages/kernel-plus-3.10.0-957.1.3.el7.centos.plus.src.rpm
Resolving mirror.nsc.liu.se (mirror.nsc.liu.se)... 130.236.101.92, 2001:6b0:17:2::1:92
Connecting to mirror.nsc.liu.se (mirror.nsc.liu.se)|130.236.101.92|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 100898069 (96M) [application/x-rpm]
Saving to: ‘kernel-plus-3.10.0-957.1.3.el7.centos.plus.src.rpm’

kernel-plus-3.10.0-957.1.3.el7.centos.pl 100%[==================================================================================>]  96.22M  5.23MB/s    in 21s     

2021-03-30 16:52:37 (4.66 MB/s) - ‘kernel-plus-3.10.0-957.1.3.el7.centos.plus.src.rpm’ saved [100898069/100898069]

http://mirror.nsc.liu.se/centos-store/7.6.1810//centosplus/Source/SPackages/kernel-plus-3.10.0-957.10.1.el7.centos.plus.src.rpm
--2021-03-30 16:52:37--  http://mirror.nsc.liu.se/centos-store/7.6.1810//centosplus/Source/SPackages/kernel-plus-3.10.0-957.10.1.el7.centos.plus.src.rpm
Resolving mirror.nsc.liu.se (mirror.nsc.liu.se)... 130.236.101.92, 2001:6b0:17:2::1:92
Connecting to mirror.nsc.liu.se (mirror.nsc.liu.se)|130.236.101.92|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 100922887 (96M) [application/x-rpm]
Saving to: ‘kernel-plus-3.10.0-957.10.1.el7.centos.plus.src.rpm’

kernel-plus-3.10.0-957.10.1.el7.centos.p 100%[==================================================================================>]  96.25M  8.52MB/s    in 34s     

2021-03-30 16:53:12 (2.82 MB/s) - ‘kernel-plus-3.10.0-957.10.1.el7.centos.plus.src.rpm’ saved [100922887/100922887]
...

Part of the download results (not all downloaded yet):

[root@localhost centos7.1810_src_packages]# ll
total 998M
-rw-r--r--. 1 root root 4.1M Sep  1  2017 ansible-2.3.0.0-3.el7.src.rpm
-rw-r--r--. 1 root root 274K Feb 23  2017 apiextractor-0.10.10-11.el7.src.rpm
-rw-r--r--. 1 root root 6.6M Feb 23  2017 babel-2.3.4-1.el7.src.rpm
-rw-r--r--. 1 root root 764K Feb 23  2017 bakefile-0.2.9-2.el7.src.rpm
-rw-r--r--. 1 root root  72K Feb 23  2017 bandit-0.13.2-1.el7.src.rpm
-rw-r--r--. 1 root root 615K Sep  1  2017 blosc-1.11.1-3.el7.src.rpm
-rw-r--r--. 1 root root  68M Feb 23  2017 boost159-1.59.0-2.el7.src.rpm
-rw-r--r--. 1 root root 1.4M Feb 23  2017 coin-or-Cbc-2.9.8-1.el7.src.rpm
-rw-r--r--. 1 root root 953K Feb 23  2017 coin-or-Cgl-0.59.9-1.el7.src.rpm
-rw-r--r--. 1 root root 1.9M Feb 23  2017 coin-or-Clp-1.16.10-1.el7.src.rpm
-rw-r--r--. 1 root root 965K Feb 23  2017 coin-or-CoinUtils-2.10.13-1.el7.src.rpm
-rw-r--r--. 1 root root 736K Feb 23  2017 coin-or-Osi-0.107.8-1.el7.src.rpm
-rw-r--r--. 1 root root 350K Feb 23  2017 coin-or-Sample-1.2.10-5.el7.src.rpm
-rw-r--r--. 1 root root 476K Feb 23  2017 conntrack-tools-1.4.2-3.el7.src.rpm
...

 

Guess you like

Origin blog.csdn.net/TomorrowAndTuture/article/details/115330273