[Crawler Tips] One-click Url of relative path into absolute path

Problem scenario

Similarly People - Science Channel when our crawler will meet a list of articles page url is the absolute path of the case, this direct access to crawl down the details page is not the result of direct 404, so you need to be spliced or URL url details URL of the page.
Insert picture description here

Processing method

There are many ways to deal with it, here is one of the simplest methods.

# 加载第三方包
page_url = 'http://society.people.com.cn/'

new_url = '/n1/2021/0209/c1008-32026861.html'

new_full_url = parse.urljoin(page_url, new_url)

print(new_full_url)

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113771985