[Network Security Takes You to Practice Crawlers-100 Practices] Practice 5: Page Turning Operation of Crawlers + Error Filtering

Table of contents

1. Page turning analysis:

2. Code logic

1. Modify the entry program

2. The page parameter is passed in

3. Complete code

1. Running results

2. Error analysis:

3. Defect code:

4. Improve the logic:

5. Improve the code:


(As mentioned earlier, there are many ways to implement any logic, let's start with the stupidest one)

(Note: The cookie needs to be filled by yourself)

1. Page turning analysis:

By comparing the URL of the first page and the second page to see if there is a difference

It can be found that the page is controlled by the parameter pageNum=

(The pageNum parameter of the first page is omitted, and if there is no data on the next page, errors may also occur)



2. Code logic

1. Modify the entry program

if __name__ == '__main__':
    with open('1.csv', 'a', encoding='utf-8', newline='') as f:
        csv_w = csv.writer(f)
        csv_w.writerow(('公司名', 'URL', '类型', '资金'))
        for page in range(1, 6):
            get_TYC_info(page)
            print(f'第{page}页已爬完')
            time.sleep(2)

(1) if __name__ == '__main__':
a conditional statement to determine whether the current module is directly executed. When this module is executed directly, the following code block will be executed.


(2) with open('1.csv', 'a', encoding='utf-8', newline='') as f:
Open the file named "1.csv" and assign it to the variable f. Open the file with 'a' mode, which means to write the file content in append mode. encoding='utf-8' means to open the file with UTF-8 encoding, and newline='' means not to insert extra line breaks when writing.


(3) csv_w = csv.writer(f)
creates a CSV writer object and passes the file object f to it. This allows the CSV file to be manipulated through this writer object.


(4) csv_w.writerow(('company name', 'URL', 'type', 'funds')) writes
a tuple of four elements to a CSV file using the CSV writer object csv_w. This tuple represents the header of the CSV file, which is the content of the first line.


(5) for page in range(1, 6):
This is a loop statement that loops from 1 to 5, and assigns the value in each loop to the variable page.


(6) get_TYC_info(page)
calls the function named get_TYC_info and passes the value page of the current cycle as a parameter. This function is used to crawl information on the TYC website.


(7) print(f'page {page} has been crawled')
prints the value page of the current cycle, and displays the message "Page X has been crawled". This is a simple prompt to show the progress of the program.


(8) time.sleep(2)
The program suspends execution for 2 seconds. This is to avoid blocking or restricting access by requesting web pages too quickly.

2. The page parameter is passed in

def get_TYC_info(page):
    TYC_url = f"https://www.tianyancha.com/search?key=&sessionNo=1688538554.71584711&base=hub&cacheCode=00420100V2020&city=wuhan&pageNum={page}"

1. Pass the page parameter into the get_TYC_info() function (page crawling function)

2、f'URL......&pageNum={page}'

Dynamically modify the page parameter in the URL



3. Complete code

(code at the end)

1. Running results

(Both sides 1 and 2 are climbable)

 There is an error on the 2nd side

(Let's analyze this error problem)

In fact, the crawled list is empty, resulting in an error


2. Error analysis:

Look at the picture and talk about the error reason:

Does it mean that at the place where we reported the error, the next company will have no related types, right?

So the crawled list is empty, which makes it impossible to continue crawling to the next level ----> so an error is reported


3. Defect code:

import time
import requests
from bs4 import BeautifulSoup
import csv

def get_TYC_info(page):
    TYC_url = f"https://www.tianyancha.com/search?key=&base=hub&city=wuhan&cacheCode=00420100V2020&sessionNo=1688108233.45545222&pageNum={page}"
    html = get_page(TYC_url)
    soup = BeautifulSoup(html, 'lxml')
    GS_list = soup.find('div', attrs={'class': 'index_list-wrap___axcs'})
    GS_items = GS_list.find_all('div', attrs={'class': 'index_search-box__7YVh6'})
    for item in GS_items:
        title = item.find('div', attrs={'class': 'index_name__qEdWi'}).a.span.text
        link = item.a['href']
        company_type = item.find('div', attrs={'class': 'index_tag-list__wePh_'}).find_all('div', attrs={'class': 'index_tag-common__edIee'})
        type_texts = [element.text for element in company_type]
        money = item.find('div', attrs={'class': 'index_info-col__UVcZb index_narrow__QeZfV'}).span.text

        print(title.strip(), link, type_texts, money)


def get_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36',
            'Cookie':'!!!!!!!!!!'
}
        response = requests.get(url, headers=headers, timeout=10)
        return response.text
    except:
        return ""


if __name__ == '__main__':
    with open('1.csv', 'a', encoding='utf-8', newline='') as f:
        csv_w = csv.writer(f)
        csv_w.writerow(('公司名', 'URL', '类型', '资金'))
        for page in range(1, 6):
            get_TYC_info(page)
            print(f'第{page}页已爬完')
            time.sleep(2)

4. Improve the logic:

Added an if judgment, the first crawl point is not none before continuing

        if company_type_div is not None:
            company_type = company_type_div.find_all('div', attrs={'class': 'index_tag-common__edIee'})
            type_texts = [element.text for element in company_type]
        else:
            type_texts=''

operation result:

All the specified 5 sides have been crawled


5. Improve the code:

(Note: The cookie needs to be filled by yourself)

import time
import requests
from bs4 import BeautifulSoup
import csv

def get_TYC_info(page):
    TYC_url = f"https://www.tianyancha.com/search?key=&sessionNo=1688538554.71584711&base=hub&cacheCode=00420100V2020&city=wuhan&pageNum={page}"
    html = get_page(TYC_url)
    soup = BeautifulSoup(html, 'lxml')
    GS_list = soup.find('div', attrs={'class': 'index_list-wrap___axcs'})
    GS_items = GS_list.find_all('div', attrs={'class': 'index_search-box__7YVh6'})
    for item in GS_items:
        title = item.find('div', attrs={'class': 'index_name__qEdWi'}).a.span.text
        link = item.a['href']
        company_type_div = item.find('div', attrs={'class': 'index_tag-list__wePh_'})
        if company_type_div is not None:
            company_type = company_type_div.find_all('div', attrs={'class': 'index_tag-common__edIee'})
            type_texts = [element.text for element in company_type]
        else:
            type_texts=''
        money = item.find('div', attrs={'class': 'index_info-col__UVcZb index_narrow__QeZfV'}).span.text

        print(title.strip(), link, type_texts, money)




def get_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36',
            'Cookie':'!!!!!!!!!!'
}
        response = requests.get(url, headers=headers, timeout=10)
        return response.text
    except:
        return ""


if __name__ == '__main__':
    with open('1.csv', 'a', encoding='utf-8', newline='') as f:
        csv_w = csv.writer(f)
        csv_w.writerow(('公司名', 'URL', '类型', '资金'))
        for page in range(1, 6):
            get_TYC_info(page)
            print(f'第{page}页已爬完')
            time.sleep(2)



network security coterie

README.md Book Bansheng/Network Security Knowledge System-Practice Center-Code Cloud-Open Source China (gitee.com) https://gitee.com/shubansheng/Treasure_knowledge/blob/master/README.md

GitHub - BLACKxZONE/Treasure_knowledgehttps://github.com/BLACKxZONE/Treasure_knowledge

Guess you like

Origin blog.csdn.net/qq_53079406/article/details/131558874