Actual combat|Teach you how to use Python crawler (with detailed source code)

What are reptiles?

Practice comes from theory. Before making a crawler, you must first understand the relevant rules and principles. You must know that the Internet is not a place outside the law. You may not be able to operate a crawler one day... First, let’s look at the definition of a
insert image description here
crawler : Web crawler (also known as web spider, web robot, more often called web chaser in the FOAF community), is a program or script that automatically grabs information on the World Wide Web according to certain rules. In a word, it is an online information porter.

Let's take a look at the rules that crawlers should follow: the robots protocol is an ASCII-encoded text file stored in the root directory of a website, which usually tells web search engine robots (also known as web spiders) what content in this website What should not be obtained by robots of search engines, and which can be obtained by robots. In one sentence, it tells you which things can be climbed and which cannot be climbed.

After understanding the definitions and rules, the last thing is to get familiar with the basic principles of reptiles. It is very simple. As a soul painter, I will draw a schematic diagram for you to understand.
insert image description here
(⊙o⊙)... Embarrassing, why is the mouse writing so ugly, I am ashamed to say that I have learned calligraphy, so I am so embarrassed.

Background of the project

The theoretical part is almost finished, and some children may think that I am long-winded, so let's not talk nonsense and talk about the practical part directly. This small crawler project is to respond to the needs of friends, to crawl the mahogany price data from the China Timber Price Index Network, so as to facilitate the writing of mahogany research reports. The website looks like this:
insert image description here
the required fields have been marked with red boxes. After a rough look at the data volume, there are more than 50,000 records in 1751 pages. If you try to copy and paste, you don’t know that you will stick to it until the year of the monkey. And python only needs to run for a few minutes to save all the data in your excel, isn't it very comfortable?
insert image description here

Project combat

Tool: PyCharm

Python version: Python 3.7

Browser: Chrome (recommended)

Friends who are writing crawlers for the first time may find it very troublesome. Let’s not panic, from the shallower to the deeper, let’s try crawling a page of data first.

1. Crawl a page

First of all, we need to briefly analyze the structure of the webpage, right-click to check, then click Network, refresh the webpage, and continue to click the first one in the Name list. We found that the request method of this website is GET, and the headers of the request reflect information such as the user's computer system and browser version.
insert image description here
Next, install and import all the libraries required by the crawler with pip. The functions of all libraries are annotated.

import csv  #用于把爬取的数据存储为csv格式,可以excel直接打开的
import time  #用于对请求加延时,爬取速度太快容易被反爬
from time import sleep #同上
import random  #用于对延时设置随机数,尽量模拟人的行为
import requests  #用于向网站发送请求
from lxml import etree    #lxml为第三方网页解析库,强大且速度快

Construct the request url, add the header information headers, that is, copy the User-Agent marked above, send a request to the server through the requests.get method, and return the html text. The purpose of adding headers is to tell the server that you are a real person visiting its website. If you directly access the server without adding headers, it will be displayed on the other server that python is accessing, then you are likely to be anti-crawled, and the common anti-crawl is to block your ip.

url = 'http://yz.yuzhuprice.com:8003/findPriceByName.jspx?page.curPage=1&priceName=%E7%BA%A2%E6%9C%A8%E7%B1%BB'
headers = {
    'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",
}
response = requests.get(url, headers=headers, timeout=10)
html = response.text  
print(html)

Let's run the above code and see the effect:
insert image description here
Seeing this, friends who are new to reptiles may be a little confused.
insert image description here
In fact, this is the source code of the webpage. Let's right click and open the source code to have a look.
insert image description here
It looks like this:
insert image description here
the data we need to extract is hidden in the source code of this web page, and we need to use the etree method in the lxml library to parse the web page.

parse = etree.HTML(html)  #解析网页

After parsing, we can happily extract the data we need. There are many methods, such as xpath, select, beautiful soup, and the most difficult re (regular expression). The data structure crawled in this article is relatively simple, so let's play with xpath directly.
insert image description here
We found that each row of data corresponds to a tr with id=173200 in the source code, so first extract these trs.

all_tr = parse.xpath('//*[@id="173200"]')

Some friends can't write xpath.
insert image description here
Then find a simple way to directly copy the required xpath.
insert image description here
All tr ​​have been extracted, and then the specific fields have to be extracted from tr in turn. For example, extract the product name field, click on the first tr, select the product, and copy its xpath. The same applies to other fields.
insert image description here
The following points should be noted, tr={key1 : value1, key2 : value2 } is a python dictionary data type (you can also save it as a list or tuple type according to your interests or needs). ''.join refers to converting the obtained list into a string. ./ refers to inheriting the previous //*[@id="173200"], and strip() means to perform simple format cleaning on the extracted data.

for tr in all_tr:
    tr = {
        'name': ''.join(tr.xpath('./td[1]/text()')).strip(),
        'price': ''.join(tr.xpath('./td[2]/text()')).strip(),
        'unit': ''.join(tr.xpath('./td[3]/text()')).strip(),
        'supermaket': ''.join(tr.xpath('./td[4]/text()')).strip(),
        'time': ''.join(tr.xpath('./td[5]/text()')).strip()
    }

Let's print print(tr) and see the effect.
insert image description here

At this point, your mood may be like this:
insert image description here
But it’s not over yet, we have the data, we have to save the csv format locally, this step is relatively simple, just paste the code directly.

with open('wood.csv', 'a', encoding='utf_8_sig', newline='') as fp:
    # 'a'为追加模式(添加)
    # utf_8_sig格式导出csv不乱码
    fieldnames = ['name', 'price', 'unit', 'supermaket', 'time']
    writer = csv.DictWriter(fp, fieldnames)
    writer.writerow(tr)

Open the newly generated wood.csv, it looks like this:
insert image description here

2. Crawl multiple pages

Don't be too happy too early, you just crawled a page of data, and people can copy and paste faster than you. Our ambition is not here, but in poetry and distance, oh no, it is to climb massive data in seconds.

So, how can we crawl multiple pages of data? That's right, the for loop.

Let's go back and analyze the URL:

http://yz.yuzhuprice.com:8003/findPriceByName.jspx?page.curPage=1&priceName=%E7%BA%A2%E6%9C%A8%E7%B1%BB

Let's try to change the page.curPage inside to 2, as follows:
insert image description here
You may find the mystery, as long as you change page.curPage, you can turn the page. OK, then we just add a loop in front of the url. format(x) is a function that formats a string and can accept an unlimited number of parameters.

for x in range(1,3):
    url = 'http://yz.yuzhuprice.com:8003/findPriceByName.jspx?page.curPage={}&priceName=%E7%BA%A2%E6%9C%A8%E7%B1%BB'.format(x)

So far, as long as you change the range, you can climb as many pages as you want, are you happy? Is it surprising?

3. Improve the crawler

If you only follow the above code to crawl the crawler, it is very likely that the program will crash after crawling more than a dozen pages. I have encountered many times that an error was reported halfway, which caused the crawler to fail. The reptile that was written with great difficulty, how can I say that it collapses.
insert image description here
There are many reasons for reporting errors. Everyone who plays reptiles knows that debugging bugs is very troublesome and requires constant trial and error. The main bug of this crawler is TimeoutError. Therefore, we need to further improve the code.
insert image description here
First of all, the above code should be encapsulated into a function, because there are the following disadvantages in not using functions:

1. Increasing complexity

2. The organizational structure is not clear enough

3. Poor readability

4. Code redundancy

5. Poor scalability

Second, add exception handling where errors may occur. That is try...except.

After completion, intercept the part, as shown in the figure below. Due to space limitations, I will not post all the codes. Friends who need complete codes can scan the QR code of CSDN official certification below on WeChat to get it for free [guaranteed 100% free].

insert image description here

About Python Technical Reserve

It is good to learn Python whether it is employment or sideline business to make money, but to learn Python, you still need a study plan. Finally, everyone will share a full set of Python learning materials to help those who want to learn Python!

Click here to get it for free: CSDN spree: "Python learning route & full set of learning materials" free sharing

Python Study Outline

The technical points in all directions of Python are sorted out to form a summary of knowledge points in various fields. Its usefulness lies in that you can find corresponding learning resources according to the above knowledge points to ensure that you can learn more comprehensively.
insert image description here

Getting Started Learning Video

Python practical case

Optical theory is useless, you have to learn to follow along, and you have to do it yourself, so that you can apply what you have learned to practice. At this time, you can learn from some actual combat cases.
insert image description here
insert image description here
This full version of the full set of Python learning materials has been uploaded to CSDN. If you need it, you can private message me to get it for free [100% free guarantee]

Guess you like

Origin blog.csdn.net/m0_59162248/article/details/129746460