Python crawler exercise: Form information crawling based on XPath

This is an introductory exercise for Python crawlers. We request data through Request, match the elements or content in the table through XPath, and use Pandas to organize the data. Let's do this Step by Step.

Determining goals and analyzing ideas

Target

The goal is to scrape tabular data from a website. The site I've chosen is: http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaList(The following may as well be called the list page).

  • There is a list library here, which contains some personal information, including name, unit name and so on.
  • The list is divided into pages, and you need to keep clicking on the next page to load more information.
  • For more information, you need to click on the name on the list to enter the subpage (hereinafter referred to as the details page) to view. Requires circular clicks and secondary crawling.

train of thought

The idea is that we get all the details page links corresponding to each person through XPath selection and page turning on the list page. Then traverse the link of the details page, crawl the information, and save it as a csv file.

observe the situation

Press F12 to open the browser's developer tools and select the Network page.

Refresh the list page, you can see the information flow, including Request URL and Request Headers, etc. Because it is a password login, we need cookies as header information when sending requests.

Click on the second page of the list page, and we found that to turn the page, we only need to modify the content after the url link page=: http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaList?pinqing_dwdm=80002&is_yw=False&page=xxx.

Clear the Network information flow, click on a name at will, and open the details page corresponding to the name change. Observe that its Request URL is: http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaEdit?zhuanjia_id=xxxxx. It requests information in the form of get, and zhuanjia_id corresponds to the ID of each person, as long as the ID is changed, the information of different people will be returned. It is easy to see that the ID information can be found in the href of the name column on the list page.

I tried it, and it is troublesome to read online tables and basic data formatting operations through pandas.read_html of pandas, so I directly use XPath to locate.

crawl list table

According to the above ideas, we will proceed in two steps, first use the list page, and summarize the information on the details page corresponding to everyone by turning the page. It can be observed that

After writing the Headers, return the webpage information through requests.get, and then use Xpath to extract and archive.

code show as below:

import requests
from lxml import etree
import pandas as pd
# parameter = {
    
    
#             "key1":"value1",
#             "key2":"value2"
#             }
headers = {
    
    "Cache-Control": "max-age=0",
           "Connection":"keep-alive",
           "Cookie":"xxxxxxxxxx",
           "Referer":"http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaList?pinqing_dwdm=80002&is_yw=False&page=2",
           "Upgrade-Insecure-Requests":"1",
           "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
    }
total_page = 2111
dfs = pd.DataFrame(columns=['name','id']);
ind = 0
for i in range(2111,total_page+1):
    response = requests.get("http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaList? \
                            pinqing_dwdm=80002&is_yw=False&page="+str(i), \
                            headers = headers)#,params = parameter)
    # print(response.url)
    # print(response.text)
    # print(response.content)
    # print(response.encoding)
    # print(response.status_code)
    html=response.text
    tree=etree.HTML(html)
    NAME = tree.xpath('//*[@id="main-content"]/div/div/div/div/div[2]/div/table/tbody/tr/td[1]/a')
    IDHREF = tree.xpath('//*[@id="main-content"]/div/div/div/div/div[2]/div/table/tbody/tr/td[1]/a/@href')
    for j in range(0,len(NAME)):
        ind = ind+1
        print(ind)
        ids = str(IDHREF[j]).split('=')
        new = pd.DataFrame({
    
    'name':NAME[j].text, 'id':int(ids[1])},index=[ind])
        dfs=dfs.append(new) 
dfs.to_csv("./教授名字和ID.csv", encoding="utf_8_sig")

There is nothing special to emphasize here, but this code is not very high-performance, and there are many places that can be optimized, such as only one tree.xpath.

Crawl the secondary information of the details page

Assuming that we have obtained the ID needed to crawl the details page, at this time, we can get all the details page information by continuously sending requests to slightly different urls. Then right-click the XPath copy tool through the developer tool to get the XPath matching mode, paste and extract the information, organize it through Pandas, and archive it. The code used is as follows:

import requests
from lxml import etree
import pandas as pd
ID1 = pd.read_csv('./教授名字和ID0.csv')
ID2 = pd.read_csv('./教授名字和ID.csv')
ID1n = list(map(int, ID1['id'])) 
ID2n = list(map(int, ID2['id'])) 
IDs = ID1n+ID2n
IDs.sort()

header = {
    
    "Connection": "keep-alive",
"Cookie": "xxxx",
"Host": "py.ucas.ac.cn",
"Referer": "http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaList?pinqing_dwdm=80002&is_yw=False&page=1936",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}
bg = 20617;
ed = len(IDs)
#ed = 2
data = pd.DataFrame(columns=['姓名','性别','单位','通讯地址','邮编','邮箱','手机',\
                             '固话','出生日期','身份证号','学科专业','方向','导师类别',\
                                 '专业技术职务','银行户头','银行地址','银行号码'])
count = -1
for i in range(bg,ed):
    count = count+1
    id = IDs[i]
    print(str(IDs[bg])+"TO"+str(IDs[ed-1])+":"+str(i)+":"+str(id)+"进行中...")
    response = requests.get("http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaEdit?zhuanjia_id="+str(id),headers=header)
    html=response.text
    # print(html)
    # with open('test.html','w',encoding='utf-8') as f:
    #     f.write(html)
    tree=etree.HTML(html)
    zjxm = str(tree.xpath('//*[@id="ZJXM"]/@value')[0]) #姓名
    zjxb = str(tree.xpath('//*[@id="ZJXB"]/option[@selected="selected"]/@value')[0])#性别
    jsdw = str(tree.xpath('//*[@id="DWDM"]/option[@selected="selected"]')[0].text)#单位
    zjtxdz = str(tree.xpath('//*[@id="ZJTXDZ"]/@value')[0])#通讯地址
    zjyzbm = str(tree.xpath('//*[@id="ZJYZBM"]/@value')[0])#邮编
    UserName = str(tree.xpath('//*[@id="UserName"]/@value')[0])#邮箱
    zjsj = str(tree.xpath('//*[@id="ZJSJ"]/@value')[0])#手机
    zjtel = str(tree.xpath('//*[@id="ZJTEL"]/@value')[0])#固话
    zjcsrq = str(tree.xpath('//*[@id="ZJCSRQ"]/@value')[0])#出生日期
    zjhm = str(tree.xpath('//*[@id="zjhm"]/@value')[0])#身份证号
    tmp = tree.xpath('//*[@id="ZJ_XKZY"]/option[@selected="selected"]')
    if len(tmp)==0:
        zjxkzy = ''
    else:
        zjxkzy = str(tmp[0].text)#学科专业
    yjfx = str(tree.xpath('//*[@id="yjfx"]/@value')[0])#方向  
    zjlb = str(tree.xpath('//*[@id="ZJLB"]/option[@selected="selected"]')[0].text)#导师类别
    tmp = tree.xpath('//*[@id="ZJZWCODE"]/option[@selected="selected"]')
    if len(tmp)==0:
        zjzwcode = ''
    else:
        zjzwcode = str(tmp[0].text)#专业技术职务  
    bank_hutou = str(tree.xpath('//*[@id="bank_hutou"]/@value')[0])#银行户头
    bank_dizhi = str(tree.xpath('//*[@id="bank_dizhi"]/@value')[0])#银行地址
    bank_xingming = str(tree.xpath('//*[@id="bank_xingming"]/@value')[0])#银行号码
    data.loc[count] = [zjxm,zjxb,jsdw,zjtxdz,zjyzbm,UserName,zjsj,zjtel,zjcsrq,zjhm,zjxkzy,yjfx,zjlb,zjzwcode,bank_hutou,bank_dizhi,bank_xingming]
data.to_csv("./"+str(bg)+"_"+str(IDs[bg])+"TO"+str(i)+"_"+str(IDs[i])+"高校教授联系方式.csv", encoding="utf_8_sig")

We know that there will always be some problems when crawling for a long time, such as network rejection. In my code, there is no exception capture in try mode. When the exception handling method I adopted, the code was broken, and the intermediate results were saved manually first. Then manually start the code rerun from the breakpoint.
I enjoy the process of manual transmission without any exception handling mechanism.

Reptiles please abide by relevant laws and regulations, and do not do illegal and criminal things

The reptiles are well written, and you can eat as much as you want. Do not crawl private data, do not crash the website. Do not use the crawler method for any profit.

Summary of crawler tips

  1. The header information of the Request can be directly copied and modified from the browser developer tool information flow.

  2. When matching information, it is recommended to save a webpage returned by the Python request for analysis, instead of directly accessing the request address in the browser, and doing element matching based on the returned result. Because sometimes the results obtained by Python requests are different from those obtained by browser requests. The code example for saving the webpage results returned by Python is as follows:

    response = requests.get("http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaEdit?zhuanjia_id="+str(id),headers=header)
    html=response.text
    print(html)
    with open('test.html','w',encoding='utf-8') as f:
    	f.write(html)
    
  3. Right mouse button, check element, Copy XPath to get XPath matching pattern, and paste it into the code interface. But when the search result is empty or incorrect, you can copy the XPath to the sister unit of the element you are looking for for comparison and analysis.

  4. Good at XPath @ functions. When we go to XPath to get the value of the element in the label, we can use @ to get the value, such as: the tree.xpath('//*[@id="DWDM"]/option[@selected="selected"]')[0].textsquare brackets here represent the index, option[@selected="selected"]which means to select the one that meets the conditions in all options (here is generated by the drop-down box ). Another example is /@value tree.xpath('//*[@id="ZJXM"]/@value')[0]in which means //*[@id="ZJXM"]to take the value of value in all to form a list.

  5. Organize a Request+XPath template as follows, every time you want to crawl something, just copy and modify it to use:

    import requests
    from lxml import etree
    import pandas as pd
    # parameter = {
          
          
    #             "key1":"value1",
    #             "key2":"value2"
    #             }
    header = {
          
          "Connection": "keep-alive",
    "Cookie": "xxx",
    "Host": "py.ucas.ac.cn",
    "Referer": "http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaList?pinqing_dwdm=80002&is_yw=False&page=1936",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
    }
    data = pd.DataFrame(columns=['姓名','单位'])
    response = requests.get("http://py.ucas.ac.cn/zh-cn/zhuanjia/ZhuanjiaEdit?zhuanjia_id=xxx",headers=header)#,params = parameter)
    html=response.text
    print(html)
    # print(response.url)
    # print(response.text)
    # print(response.content)
    # print(response.encoding)
    # print(response.status_code)
    with open('test.html','w',encoding='utf-8') as f:
        f.write(html)
    tree=etree.HTML(html)
    zjxm = str(tree.xpath('//*[@id="ZJXM"]/@value')[0]) #姓名
    jsdw = str(tree.xpath('//*[@id="DWDM"]/option[@selected="selected"]')[0].text)#单位
    data.loc[0] = [zjxm,jsdw]
    data.to_csv("xxx.csv", encoding="utf_8_sig")
    

Guess you like

Origin blog.csdn.net/lusongno1/article/details/128195834