What we crawl today is the free material in the webmaster material.
Let me talk about the process first. First, I won’t talk about getting the response data. After getting the data, I will analyze the data. After the completion, the next step is to analyze the data again on the download address of the details page, and then save the data. It may not be clear. Let's start to code and practice.
import requests
import os
import random
from lxml import etree
if __name__=='__main__':
if not os.path.exists('./jianlisucai'):
os.mkdir('./jianlisucai')
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36',
}
url = 'http://sc.chinaz.com/jianli/free_%d.html'
#分页操作
for pageNum in range(1,5):
if pageNum == 1:
new_url = 'http://sc.chinaz.com/jianli/free.html'
else:
new_url = format(url%pageNum)
#获取数据
response = requests.get(url=new_url,headers=headers)
response.encoding ='utf-8'
page_text = response.text
#实例化对象
tree=etree.HTML(page_text)
#定位简历详情页标签//*为全局标签
div_list = tree.xpath('//div[@id="container"]/div')
print(div_list)
for div in div_list:
#获取详情页的url
detail_url = div.xpath('./a/@href')[0]
#print(detail_url)
#设置简历名字
page_name = div.xpath('./a/img/@alt')[0]+'.rar'
#print(page_name)
#获取详情页数据
responsee = requests.get(url=detail_url,headers=headers)
responsee.encoding = 'utf-8'
detail_data = responsee.text
#print(detail_data)
tree2= etree.HTML(detail_data) #实例化
#选择下载地址1 即li[1]
download_list = tree2.xpath('//div[@id="down"]/div[2]/ul/li[1]/a/@href')[0]
#print(download_list)
download_data= requests.get(url=download_list,headers=headers)
download_data.encoding='utf-8'
download_data=download_data.content
filepath = 'jianlisucai/'+page_name
with open(filepath,'wb') as fp:
fp.write(download_data)
print(page_name,'爬取成功!!!')
You can verify the results yourself. The following mainly talk about a few mistakes I encountered when doing it.
- The request header error
headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36', }
must be written as User-Agent, and the request header error will appear after replacing-with a space, that is, Error 400. The request has an invalid header name
2. Paging operation The
paging operation must be the first page and the following The url of the page number is distinguished.
3. Decoding problem
Two methods to solve Chinese garbled
codes : -img_name.encode('iso-8859-1').decode('gbk')
-response.encoding='utf-8' The
two decoding methods can be used interchangeably. I am used to using the second one.