Article Directory
Problem scenario
Similarly People - Science Channel when our crawler will meet a list of articles page url is the absolute path of the case, this direct access to crawl down the details page is not the result of direct 404, so you need to be spliced or URL url details URL of the page.
Processing method
There are many ways to deal with it, here is one of the simplest methods.
# 加载第三方包
page_url = 'http://society.people.com.cn/'
new_url = '/n1/2021/0209/c1008-32026861.html'
new_full_url = parse.urljoin(page_url, new_url)
print(new_full_url)