Python reptiles crawling multiple pages of data

Now there is a demand, crawling http://www.chinaooc.cn/front/show_index.htm all course data.

 

 

 

However, according to the conventional method of crawling it is not feasible, because the data is paged:

 

 The key is, whether it is the first few pages, the browser address bar is constant, so every reptile crawling only the first page of data. In order to obtain information about the new data, click on the F12, view the page source code, you can find data using JS dynamically loaded, but no address, only a skipToPage (..) function.

So, the solution is:

  1. Obtain the requested information, including header and form data (form information)
  2. Analog request, obtain data
  3. Data analysis, the results obtained

The following implementation steps:

1. acquiring request information, as shown in FIG console selected as follows Network-> XHR, case, jump button click on the page, the console will request issued, and then select the file requesting (third step), and then select headers, shown below is the request header information.

 

 

2, using the Python simulation request Request Headers found at Headers portion, which is the first data request.

 

 Then find the Form Data

 

 Copy the contents of the above, the following code is formed

headers = {
    'Accept': 'text/html, */*; q=0.01',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,ko;q=0.7',
    
    'Connection': 'keep-alive',
    'Content-Length': '61',
    'Cookie': 'route=bd118df546101f9fcee5c1a58356a008; JSESSIONID=047BD79E9754BAED525EFE860760393E',
    'Host': 'www.chinaooc.cn',
    'Origin': 'http://www.chinaooc.cn',
    'Pragma': 'no-cache',
    'Referer': 'http://www.chinaooc.cn/front/show_index.htm',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
    
    'X-Requested-With': 'XMLHttpRequest',
    'Content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
    }

    
form_data = {
    'pager.pageNumber':'2',
    'pager.pageSize': '50',
    'pager.keyword': '',
    'mode': 'page'
    }

Analog transmission request, each change in form_data different page data can be obtained, as follows:

form_data['pager.pageNumber']=times
url = 'http://www.chinaooc.cn/front/show_index.htm'
response = requests.post(url, data=form_data, headers=headers)

3, analysis of information returned in response to obtain data.

 

The complete code is as follows:

#!/usr/bin/env python

# -*- coding: utf-8 -*-
import requests
import re
from bs4 import BeautifulSoup

class item:
    def __init__(self):
        self.num=0
        self.school=''
        self.clazz=''
        self.url=''
        
headers
= { 'Accept': 'text/html, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,ko;q=0.7', 'Connection': 'keep-alive', 'Content-Length': '61', 'Cookie': 'route=bd118df546101f9fcee5c1a58356a008; JSESSIONID=047BD79E9754BAED525EFE860760393E', 'Host': 'www.chinaooc.cn', 'Origin': 'http://www.chinaooc.cn', 'Pragma': 'no-cache', 'Referer': 'http://www.chinaooc.cn/front/show_index.htm', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', 'Content-type': 'application/x-www-form-urlencoded; charset=UTF-8' } form_data = { 'pager.pageNumber':'2', 'pager.pageSize': '50', 'pager.keyword': '', 'mode': 'page' } times =20 while times < 34: form_data['pager.pageNumber']=times url = 'http://www.chinaooc.cn/front/show_index.htm' response = requests.post(url, data=form_data, headers=headers) soup = BeautifulSoup(response.content, "html.parser") tr_list = soup.find_all('tr') my_tr_list = tr_list[1:-1] for tr in my_tr_list: td_list = tr.find_all('td') a = item() a.num = td_list[0].contents[0] a.school = td_list[1].contents[0] a.clazz = td_list[2].contents[0].replace('\"',' ') a.url = td_list[5].find_all('a')[0]["href"] #name = with open('E:/data/'+'['+a.num+']['+a.school+']['+a.clazz+'].html','wb') as f: res = requests.get(a.url) res.encoding = res.apparent_encoding f.write(res.content) times= times+1

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/PPWEI/p/11805247.html