Introduction to Python crawler basics, you will learn after reading this article

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it yourself

Python free learning materials and group communication answers Click to join


Hello everyone, today we are going to talk about the basic operation of Python crawler, anyway, I am getting started like this, haha.

 

 

In fact, when I first learned python, I went for data processing and analysis, pandas or something. Later, I found that crawlers are very fun and can solve the cumbersome problem of manually collecting online data. For example, I use more to crawl the content of a tappap game evaluation, the barrage of a certain video website and a certain drama, the shop information of a certain review, and a certain tooth. Host information and so on.

 


Regarding crawlers, I only have some basic operations, but from personal experience, I feel that these basics can basically meet the more conventional needs. For advanced crawler skills, everyone will naturally have advanced learning ideas and ways after understanding the basics of crawling.

Next, let's get into the topic~

0. Crawler basic process

Modularizing the crawler process can basically be summarized into the following steps:

  • [√] Analyze the web page URL: open the website you want to crawl data, and then find the real page data URL address;
  • [√] Request webpage data: simulate requesting webpage data, here we introduce the use of requests library;
  • [√] Analyze webpage data: According to the requested webpage data, we use different methods to parse it into the data we need (if the webpage data is html source code, we use Beautiful Soup, xpath and re regular expression analysis; if the webpage The data is in json format, we can directly use basic knowledge such as dictionary list to process)
  • [√] Store webpage data: Generally speaking, the parsed data is more structured and can be saved as text such as txt, csv, json or excel, or it can be stored in a database such as MySql, MongoDB or SqlLite.

1. Analyze the web page URL

When we have a target website, sometimes we find that for static web pages , we only need to pass the URL in the web page address bar to the get request to get the web page data directly. But if this is a dynamic webpage , we cannot simply pass the URL of the webpage address bar to the get request to obtain webpage data. Often at this time, when we turn the page, we will find that the URL in the webpage address bar will not happen. Changing.

Next, we will separately introduce how to obtain the real page data URL address in these two cases.

1.1 Static webpage

For static web pages, the URL in the web address bar is what we need.

Taking shell second-hand housing network ( https://bj.ke.com/ershoufang/ ) as an example, we can see that the URL of the web address bar becomes ( https:// bj.ke.com/ershoufang/pg2/ ). This type of situation is mostly static web pages, and the URL rules for page turning are very obvious.

1.2 Dynamic web pages

For dynamic web pages, we can generally find the real URL address through the following steps:

1. You need to press "F12" to enter the developer mode of the browser;
2. Click "Network"—>XHR or JS or you can find them all;
3. Turn the page (maybe click the next page or slide down to load more );
4. Observe the content changes of the name module in step 2 and look for it.

Taking Huya Xingxiu District ( https://www.huya.com/g/xingxiu ) as an example, we can see that the URL of the web address bar does not change when turning pages (such as to page 2).

In order to easily find the real URL address, we can find the following screenshots in the developer mode. Preview is the preview result, which can facilitate us to match and locate the specific Name.

 

 

When we locate the specific Name, select Headers on the right to view the relevant parameter information required by the requested webpage, and it is better to clarify its changing law. Taking Huya Xingxiu as an example, its real URL address and its changing rules are as follows:

 

 

Real URL address

Real URL address

2. Request web page data

After we have determined the URL of the real data, we can use the get or post method of requests to request web page data.

For more ways to use the requests library, you can go to ( https://requests.readthedocs.io/zh_CN/latest/ ) to view.

2.1 Send a get request

In [1]: import requests

In [2]: url = 'https://bj.ke.com/ershoufang/'

In [3]: r = requests.get(url)

In [4]: type(r)
Out[4]: requests.models.Response

In [5]: r.status_code
Out[5]: 200

What we get is a Response object. If we want to get web page data, we can use the text or content attribute to get it. In addition, if the web page data we get is in json format, we can use the built-in json() decoder method in Requests to  help You deal with json data.

  • r.text: string type data, general web page data is text type use this attribute
  • r.content: Binary type data, this attribute is generally used when the web page data is a video or picture
  • r.json(): json data decoding, this method
    is generally used when webpage data is in json format. For some dynamic webpages, the requested URL is a combination of basic url and keyword parameters. At this time, we can use params keyword parameters to A string dictionary provides these parameters.
In [6]: url = 'https://www.huya.com/cache.php'
   ...: parames = {
   ...:     'm': 'LiveList',
   ...:     'do': 'getLiveListByPage',
   ...:     'gameId': 1663,
   ...:     'tagAll': 0,
   ...:     'page': 2, # 翻页变化的就是这个参数
   ...:     }
   ...: 
   ...: r = requests.get(url, params=parames)

In [7]: r.url
Out[7]: 'https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId=1663&tagAll=0&page=2'

2.2 Send post request

Usually, you want to send some data encoded as a form—much like an HTML form. To achieve this, simply pass a dictionary to the data parameter. Your data dictionary will be automatically encoded as a form when making a request:

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.post("http://httpbin.org/post", data=payload)

Many times the data you want to send is not encoded as a form. If you pass a string instead of a dict, the data will be published directly.

>>> import json

>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}

>>> r = requests.post(url, data=json.dumps(payload))

In addition to encoding the dict yourself, you can also use the json parameter to pass it directly, and then it will be automatically encoded.

>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}

>>> r = requests.post(url, json=payload)

2.3 custom request header

When simulating the request, if the request header is not set, it is easier for the website to find that it is from the crawler script, and some websites will reject this kind of simulation request. Therefore, we can simply set the request header to camouflage, usually setting the browser.

headers = {
     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
     }
r = requests.get(url, headers=headers)

In fact, there are many parameters that can be set for the request header. Specifically, you can look at the request header module in the developer mode for analysis and processing in the actual crawling process.

 

Huya Xingxiu request header

Huya Xingxiu request header

2.4 Response code

We saw in 2.1 that the response code is obtained through the r.status_code attribute. Generally speaking, if the number 200 is returned, it means that the web page data is successfully obtained.

Response codes are divided into five types, represented by their first digit: 1xx: information, request received, continue processing 2xx: success, the behavior is successfully accepted, understood and adopted 3xx: redirection, in order to complete the request, you must Further action 4xx: client error, the request contains a syntax error or the request cannot be fulfilled 5xx: server error, the server cannot fulfill an obviously invalid request

3. Parse the data

As mentioned above, the web page data we requested is either Html source text or json string text. The parsing methods of the two are different. Below we will give a brief description, and it depends on the actual situation.

3.1 Web page html text analysis

For web page html text, here are three parsing methods of Beautiful Soup, xpath and re regular expression.

Take the latest housing of shell second-hand housing ( https://bj.ke.com/ershoufang/co32/ ) as an example, the html source code is as follows, we use the data after get request to analyze.

 

3.1.1 Beautiful Soup

For more ways to use the Beautiful Soup library, you can check it out ( https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/ )

First install pip install beautifulsoup4.

We pass the web page html text content r.text as the first parameter to the BeautifulSoup object, and the second parameter of the object is the type of the parser (lxml is used here). At this time, the BeaufulSoup object is initialized. Then, assign this object to the soup variable.

from bs4 import BeautifulSoup
import requests

url = 'https://bj.ke.com/ershoufang/co32/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

 

The code to get the name of the house is as follows:

# 获取全部房源 所在的节点
sellList = soup.find(class_="sellListContent")
# 获取全部房源节点列表
lis = sellList.find_all('li',class_="clear")
# 选取第一个房源节点
div = lis[0].find('div',class_="info clear")
# 采集房源名称
title = div.find('div',class_="title")
print(title.text)
明春西园 2室1厅 南 北

3.1.2 xpath

XPath, the full name of XML Path Language, is the XML path language, which is a language for finding information in XML documents.

First install lxmlpip install lxml.

The common rules are as follows:

 

 

First import the etree module of the lxml library, then declare a section of HTML text, call the HTML class to initialize, and successfully construct an XPath parsing object.

from lxml import etree
import requests

url = 'https://bj.ke.com/ershoufang/co32/'
r = requests.get(url)
html = etree.HTML(r.text)

 

The xpath obtained by copy: //*[@id="beike"]/div[1]/div[4]/div[1]/div[4]/ul/li[1]/div/div[1 ]/a

# 获取 全部房源所在节点 ul,根据属性匹配精准查找
ul = html.xpath('.//ul[@class="sellListContent"]')[0]
# 获取房源列表
lis = ul.xpath('.//li[@class="clear"]')
# 选取第一个房源节点
li = lis[0]
# 获取其房源名称
li.xpath('./div/div[1]/a/text()')
['明春西园 2室1厅 南 北']

3.1.3 re regular

Regarding re-regular parsing of webpage html, you can also check the previously published article " Learning Python regular expressions re for crawler webpage HTML ".

# 找到房源名称所在的前后字符,然后组成正则表达式
re.findall(r'<a class="VIEWDATA CLICKDATA maidian-detail" title="(.*?)"',r.text,re.S)[0]
'明春西园 2室1厅 南 北'

3.2 json text analysis

The r.json() is provided in requests , which can be used for json data decoding. This method is generally used when webpage data is in json format. In addition, it can also be processed through the json.loads() and eval() methods.

 

url = 'https://www.huya.com/cache.php'
parames = {
     'm': 'LiveList',
     'do': 'getLiveListByPage',
     'gameId': 1663,
     'tagAll': 0,
     'page': 2, # 翻页变化的就是这个参数
     }

r = requests.get(url, params=parames)
data = r.json()
type(data)
dict

The data obtained after parsing in this way is a dictionary, and then we are looking at which fields in the dictionary we need, just take it out.

4. Store data

When we get the data we want, we can write it locally.

For text data, you can use the csv module or pandas module to write to a local csv file or excel file; at the same time, you can also use the pymysql module to write to the database or sqlite to write to the local database.

For videos or pictures, you can open a file and write the binary content and save it locally.

You can learn about storing data in combination with actual cases.

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/112824239