scrapy爬虫遇到相对路径问题的解决

网站中很多链接用的是相对路径,直接爬取会产生报错:

Missing scheme in request url: ../index.html

在python3中使用

from urllib.parse import urljoin
>>> urljoin("http://www.asite.com/folder/currentpage.html", "anotherpage.html")
'http://www.asite.com/folder/anotherpage.html'
>>> urljoin("http://www.asite.com/folder/currentpage.html", "folder2/anotherpage.html")
'http://www.asite.com/folder/folder2/anotherpage.html'
>>> urljoin("http://www.asite.com/folder/currentpage.html", "/folder3/anotherpage.html")
'http://www.asite.com/folder3/anotherpage.html'
>>> urljoin("http://www.asite.com/folder/currentpage.html", "../finalpage.html")
'http://www.asite.com/finalpage.html'
将当前链接与相对路径可以自动拼接。








猜你喜欢

转载自blog.csdn.net/sigmeta/article/details/80940488