爬虫的几个问题

前两天朋友推荐给我一个连接，是一部正在上映的电影，看起来很清晰，解析了一下发现很容易爬下来，就是用最简单的request方法，一个最最简单的语句就是

import requests
import os

url="网址"
root = "存储路径"
path = root+url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        r.raise_for_status()
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件保存成功")
except:
    print("爬取失败")

但是爬下来以后，里面足足有1570个视频文件。

幸亏之前遇到过相关的问题，知道一个解决方法，就是通过Windows中的指令进行拼接。

copy /b a.mp4+b.mp4 c.mp4

但是这个指令只支持两两拼接，所以只能用

......

copy /b 3.ts+6NVYBh6729003.ts 4.ts

copy /b 4.ts+6NVYBh6729004.ts 5.ts

copy /b 5.ts+6NVYBh6729005.ts 6.ts

copy /b 6.ts+6NVYBh6729006.ts 7.ts

copy /b 7.ts+6NVYBh6729007.ts 8.ts

......

这样的语句进行拼接操作，拼接很正常，但是拼着拼着我就发现了问题，磁盘满了，因为我这样的操作对于磁盘的需求量是指数增长的。所以需要在每个拼接语句后面加一条删除上一条文件的语句。

最后变成了

......

copy /b 3.ts+6NVYBh6729003.ts 4.ts
del 3.ts

copy /b 4.ts+6NVYBh6729004.ts 5.ts
del 4.ts

copy /b 5.ts+6NVYBh6729005.ts 6.ts
del 5.ts

copy /b 6.ts+6NVYBh6729006.ts 7.ts
del 6.ts

copy /b 7.ts+6NVYBh6729007.ts 8.ts
del 7.ts

copy /b 8.ts+6NVYBh6729008.ts 9.ts
del 8.ts

.....

这样以后，磁盘的需求就没那么大了，这真的一个非常容易被忽视的问题！！

测试文件链接。

猜你喜欢