xpath crawls free resumes in webmaster materials

What we crawl today is the free material in the webmaster material.
Let me talk about the process first. First, I won’t talk about getting the response data. After getting the data, I will analyze the data. After the completion, the next step is to analyze the data again on the download address of the details page, and then save the data. It may not be clear. Let's start to code and practice.

import requests
import os
import random
from lxml import etree
if __name__=='__main__':
    if not os.path.exists('./jianlisucai'):
        os.mkdir('./jianlisucai')
    headers = {
    
    
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36',
    }
    url = 'http://sc.chinaz.com/jianli/free_%d.html'
    #分页操作
    for pageNum in range(1,5):
        if pageNum == 1:
            new_url = 'http://sc.chinaz.com/jianli/free.html'
        else:
            new_url = format(url%pageNum)
        #获取数据
        response = requests.get(url=new_url,headers=headers)
        response.encoding ='utf-8'
        page_text = response.text
        #实例化对象
        tree=etree.HTML(page_text)
        #定位简历详情页标签//*为全局标签
        div_list = tree.xpath('//div[@id="container"]/div')
        print(div_list)
        for div in div_list:
            #获取详情页的url
            detail_url = div.xpath('./a/@href')[0]
            #print(detail_url)
            #设置简历名字
            page_name = div.xpath('./a/img/@alt')[0]+'.rar'
            #print(page_name)
            #获取详情页数据
            responsee = requests.get(url=detail_url,headers=headers)
            responsee.encoding = 'utf-8'
            detail_data = responsee.text
            #print(detail_data)
            tree2= etree.HTML(detail_data)  #实例化
            #选择下载地址1 即li[1]
            download_list = tree2.xpath('//div[@id="down"]/div[2]/ul/li[1]/a/@href')[0]
            #print(download_list)
            download_data= requests.get(url=download_list,headers=headers)
            download_data.encoding='utf-8'
            download_data=download_data.content
            filepath = 'jianlisucai/'+page_name
            with open(filepath,'wb')  as fp:
                fp.write(download_data)
            print(page_name,'爬取成功!!!')

You can verify the results yourself. The following mainly talk about a few mistakes I encountered when doing it.

  1. The request header error headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36', }
    must be written as User-Agent, and the request header error will appear after replacing-with a space, that is, Error 400. The request has an invalid header name
    2. Paging operation The
    paging operation must be the first page and the following The url of the page number is distinguished.
    3. Decoding problem
    Two methods to solve Chinese garbled
    codes : -img_name.encode('iso-8859-1').decode('gbk')
    -response.encoding='utf-8' The
    two decoding methods can be used interchangeably. I am used to using the second one.

Guess you like

Origin blog.csdn.net/qwerty1372431588/article/details/106117458