On the way of learning reptiles, how many pitfalls lie ahead

foreword

Reptiles, collection, these things have become more and more familiar, and more and more have entered the learning scope of beginners, so have these beginners prepared for collection and reptiles? How many tumbles and pits have been stepped on along the way to really straighten out the collection process?

Although Lao Gu has just learned python not long ago, the time it takes to collect this piece is not too short. Let me combine some practical cases to talk about how many pitfalls are easy to step on on the way of collection.

The acquisition is successful, but there is no data?

Many friends often encounter this situation. Recently, there are many such examples in the question and answer

So, has it been successfully collected? Where did the data go? Let's first explain the situation

The data is on the collected page and has the correct format

This is the most ideal state, the returned page code is what we see

When you are collecting, one thing you like to do is to open the console and view the elements

For example, we collect such a page https://www.ccgp.gov.cn/cggg/zygg/jzxcs/202303/t20230312_19542353.htm

insert image description here
We can easily find the location of the text we need to grab by looking at the elements. In the div with the style class named vF_detail_content_container, we try to use python to achieve this grab and extract the text

from bs4 import BeautifulSoup
import requests
headers={
    
    
    "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63"
}
response = requests.get("https://www.ccgp.gov.cn/cggg/zygg/jzxcs/202303/t20230312_19542353.htm",headers=headers)
html = response.content.decode('utf8')
soup = BeautifulSoup(html, "html.parser")
contents = soup.findAll('div',attrs={
    
    'class':'vF_detail_content_container'})
for content in contents:
    print(content)

insert image description here
This is the ideal collection. With a few simple words, you can get the data

The data is on the collected page, but there is no information when it is extracted

Take the question of a small partner in the Q&A area as an example, https://ask.csdn.net/questions/7897149/54099048?spm=1001.2014.3001.5501, the collection target this time is https://pic.sogou.com/ pics?query=orange peel
insert image description here

We still found the node information of all the pictures by looking at the elements. Well, in all the lis under the style figure-result-list, the Q&A friends did the same. Let’s simulate and realize it

import requests
from lxml import etree
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}
url = 'https://pic.sogou.com/pics?query=橘子皮'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="figure-result"]/ul/li')
print(li_list)

insert image description here
Acridine? Obviously there is content on the page, why can’t I grab it?

Um. . . . . Here, we need to add a little basic knowledge, that is, where does the content displayed by the browser come from?

The little friend couldn't help asking, isn't it from the page?

That said, there is nothing wrong with it, but it is incomplete. Listen to what Lao Gu said first.

Supplementary lessons: what information is on the page

html part

First, the browser first loads the html page. No matter what the suffix of the page is or what language is used to develop it, when it reaches the browser, it must be in html format. Here is the first element. When we collected the ccgp page just now, we only extracted the content in this html format.

css part

After the browser finishes loading the html content, it starts to render the page, which has reached a friendly visual effect. At this time, it starts to parse the html, and it will find out all the style definitions from it, and introduce them through the link tag The css file to render the page content

script section

After the rendering effect is done, the browser starts the next job, interpreting and running the script, which is what we usually call js, or vbs (although few people use it now), at this time some Interactive content, or some subsequent supplementary content, will also be added to the currently rendered page through script operations

Other parts introduced or requested through scripts

insert image description here
Or this orange peel, we can find out through the network (network) page that there are many types of requests, in addition to common pictures, js, and an xhr, this is the imported request part

The remedial class is over, and the above content is the complete data of a page, so let's go back to the data extraction of the orange peel page. Where is his data?

insert image description here
On the page, we click the right mouse button, and it will pop up a shortcut menu. The bottom part is our commonly used viewing elements (inspect in some browsers). It can easily view DOM information, but more importantly, it is actually viewing the source code.

This time, let's check the source code to see if there is anything we need in the orange peel page.
insert image description here
According to the inspection element, we can see that the name of the first picture is "Women put orange peels on their navels before going to bed." Up, stick to it for a week, several big surprise changes come uninvited", then we found this content in the source file
insert image description here
, then it is in the collected page, but not in the correct position, it is through the script ( js) is rendered to the position we see, so how do we get it, that is, the position where the bottom appears, we found that it is in a script tag, then let's realize the extraction of this information

import requests
from lxml import etree
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}
url = 'https://pic.sogou.com/pics?query=橘子皮'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
js_list = tree.xpath('//script')
for i,js in enumerate(js_list):
    print(' == > ')
    print(i,etree.tostring(js_list[i]))

Through traversal, we see that the second script fragment contains the data we need, but we need to parse it ourselves, but the text inside is all in unicode format and needs to be converted

Come on, let's cut the head and tail, leaving only the data part to see


import requests
from lxml import etree
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
}
url = 'https://pic.sogou.com/pics?query=橘子皮'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
js_list = tree.xpath('//script')
#for i,js in enumerate(js_list):
#    print(' == > ')
#    print(i,etree.tostring(js_list[i]))
# 已知js_list第二项是数据了
def unicode(n): # 通过正则,将unicode转成字符
    return chr(int(n.group(1)))
import re
import json
jscode = re.sub('^.*?=|;\(function.*|;?</script>','',etree.tostring(js_list[1]).decode('utf8'))
jscode = re.sub('&#(\d+);',unicode,jscode)
jsdata = json.loads(jscode)
print(jscode)

insert image description here
Very good, we loaded the data into json format through json, and checked it through variables, and got jsdata['searchList']['searchList'][0] which is the content of the first picture, and its title is what we just found The title, what women go to bed. . .

Well, this little hole is over.

The data is in the js fragment of the page, but I don’t know how to get it out correctly

We encountered this situation in the example of the orange peel just now. We already know that it is in the second js fragment, but it is full of content like &#number; in the browser, it will be correct. Display the corresponding unicode characters, but when we grab it, we don't know this thing, is this for people to see?

Let’s look at another small partner’s question and answer question, https://ask.csdn.net/questions/7896060/54096521?spm=1001.2014.3001.5501

He wants to capture the content of a novel chapter, and he also encountered such a situation. Let's capture the content first, and then analyze it together

import requests
url = 'https://yc.ifeng.com/book/3303804/10/'
headers = {
    
    
    'host':'yc.ifeng.com',
    'referer':'https://yc.ifeng.com/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
    }
res = requests.get(url=url,headers=headers)
res.encoding = 'UTF-8'
print(res.text)

insert image description here
ah. This time the text is in hexadecimal unicode characters. . . . Moreover, it is also in js, don't worry, let's make up lessons first

Remedial lesson: Use regular expressions to convert characters in unicode or other encoding formats into normal characters

In python, the package where the regularity is located is re, and we can use the regularity by importing re

In regular rules, the most common thing we use is to judge whether it is legal, such as mobile phones, mailboxes, etc., and to extract the content we need according to regular rules

However, in the real collection, there is also a regular replacement, which is also very, very commonly used, and is very, very easy to use. Regarding the regular content, I won’t go into details here. You can read my previous article . and the regular part of python

Here directly talk about python regular replacement using delegation

In the orange peel example, we already have such a custom method unicode

import re
def unicode(n):
    return chr(int(n.group(1)))
print(re.sub('&#(\d+);',unicode,'&#22899;&#24615;&#30561;&#21069;&#25226;&#27224;&#23376;&#30382;&#36148;&#22312;&#32922;&#33040;&#19978;,&#22362;&#25345;&#19968;&#21608;,&#20960;&#22823;&#24778;&#21916;&#21464;&#21270;&#19981;&#35831;&#33258;&#26469;'))

insert image description here
Here, the replacement method in re.sub (matching regularity, replacement character or method, original string) is used, and the specific content is converted according to certain rules by entrusting this method

Well, the novel just now is the same

import re
def unicodeHex(n):
    return chr(int(n.group(1),16))
print(re.sub('%u([0-9a-fA-F]{4})',unicodeHex,'%u3000%u3000%u53F6%u660A%u770B%u4E86%u4E00%u773C%u8FD9%u4E2A%u7F8E%u5973%uFF0C%u5012%u662F%u60F3%u8D77%u6765%u4E86%uFF0C%u8FD9%u662F%u590F%u4E91%uFF0C%u4EE5%u524D%u8FD8%u5728%u53F6%u6C0F%u5BB6%u65CF%u7684%u65F6%u5019%uFF0C%u5979%u8DDF%u8FC7%u81EA%u5DF1%uFF0C%u60F3%u4E0D%u5230%u5979%u73B0%u5728%u5C45%u7136%u662F%u53F6%u6C0F%u6295%u8D44%u516C%u53F8%u7684%u603B%u88C1%u79D8%u4E66%u3002%3Cbr%2F%3E%3Cbr%2F%3E%u3000%u3000%u201C%u597D%u4E45%u4E0D%u89C1%u3002%u201D%u53F6%u660A%u70B9%u4E86%u70B9%u5934%u3002%3Cbr%2F%3E%3Cbr%2F%3E%u3000%u3000%u201C%u590F%u79D8%u4E66%uFF0C%u4F60%u4E0D%u4F1A%u662F%u7CCA%u6D82%u4E86%u5427%uFF1F'))

insert image description here
Here are just two examples. In actual work, you will encounter various characters that need to be decoded, such as urldecode, unescape, etc. When the information received is not for people to see, remember to transcode it

Check the source file, there is no content we need?

This situation is also very, very common. This is what we just said in the make-up class, other parts introduced or requested through scripts

Or use the example of the Q&A friends, https://ask.csdn.net/questions/7900390/54106867?spm=1001.2014.3001.5501, this time the goal is aimed at station B

from bs4 import BeautifulSoup
import requests
headers={
    
    
    "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63"
}
for page in range(1,10,1):
    response = requests.get(f"https://www.bilibili.com/anime/index/#season_version=-1&spoken_language_type=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page={
      
      page}",headers=headers)
    html = response.text
    print(html)
    soup = BeautifulSoup(html, "html.parser")
    all_bangumi_titles = soup.findAll("a", attrs={
    
    "class":"bangumi-title"})
    all_pub_infos = soup.findAll("p", attrs={
    
    "class": "pub-info"})
for bangumi_title in all_bangumi_titles:
    bangumi_title_string=bangumi_title.string
    print(bangumi_title)

Let's ignore some grammatical errors in the code of the little friend, I will help him to correct it first

Then this little friend is very confused, ah, I have successfully captured it, why?

But the running result has no corresponding drama name and information, and I tried to print the returned response.text
result is a repeated piece of code that does not contain the target field

Well, let’s go back to viewing elements
insert image description here
. Oh, the content of the index page is very small, it’s just a shelf, and all the content is basically loaded by other means. Let’s set up the filter of the network to see how many xhr

insert image description here
On the network (network) page, click on the funnel, then select xhr, and find a lot of data, these are some asynchronous request data, because it is an asynchronous request, so there is a separate category called xhr, synchronous, called js then
, We need to find, where is the data?

insert image description here
For xhr, there are usually only two types of returned content, one is the command that needs to be executed immediately, usually for anti-collection, and the other is the data in json format, we can use the preview method to facilitate View his data content, and remember this xhr request address, this is what we need to grab, not the index page of station B

The collection is successful, but the content is garbled

On the Internet, due to various historical reasons, the encoding of many websites is not necessarily utf8, but some local encodings of ANSI standards, such as gbk, big5, etc., and some even utf8, but the local encoding is also garbled

Let's go back to the first collection of this article

from bs4 import BeautifulSoup
import requests
headers={
    
    
    "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63"
}
response = requests.get("https://www.ccgp.gov.cn/cggg/zygg/jzxcs/202303/t20230312_19542353.htm",headers=headers)
html = response.content.decode('utf8')
soup = BeautifulSoup(html, "html.parser")
contents = soup.findAll('div',attrs={
    
    'class':'vF_detail_content_container'})
for content in contents:
    print(content)

Friends, have you noticed that I use response.content instead of the commonly used response.text, because we did not specify the encoding format when sending the request, and the little friend who collected the novel did a very good job, so he specified it Coding, although he also failed to get the novel out in the end. Going back to the ccgp page, since I didn’t specify an encoding, what I get in response.text is a bunch of garbled content
insert image description here

Here I will explain the various situations of garbled codes on the website

The generation of garbled characters is basically caused by the inconsistency of the codes on both sides. In this case, we either specify the code like the little friend who collected the novel, or use response.content.decode after collecting it like Lao Gu. to decode

These are all easy to say, but the data of many websites are not serious readable data, at least not for people to read. For example, the novel website and Sogou picture website we encountered just now require us to decode the encoded content by ourselves. come out

There are still some friends who will encounter some more serious problems. I can’t reproduce this situation for the time being, but I can still describe it. That is, when resposne.content.decode, I will encounter such an error

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte

This is because the returned data uses gzip or deflate compression technology, and we need to decompress and decode the data in the same way

According to the content-encoding information in response.header, determine which method to use to decompress and decode

# content-encoding : gzip
import gzip
html = gzip.decompress(response.content).decode('utf8')
# content-encoding : deflate
import zlib
try:
	html = zlib.decompress(response.content, -zlib.MAX_WBITS).decode('utf8')
except zlib.error:
	html = zlib.decompress(response.content).decode('utf8')

Of course, there is another known compression algorithm br, but Lao Gu has not encountered it yet, so I don’t know what package is needed for decompression and decoding

How do we extract the crawled content

Use lxml or bs4 to process html

Usually, when we grab a page, what we grab is html, and everyone is already familiar with it. Whether you use BeautifulSoup's html.parser or lxml's xpath, you can analyze the html content.

import requests
headers={
    
    
    "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63"
    ,'accept-encoding':'gzip, deflate, br'
}
response = requests.get("https://www.ccgp.gov.cn/cggg/zygg/jzxcs/202303/t20230312_19542353.htm",headers=headers)
html = response.content.decode('utf8')
正文 = []
# 使用 BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
contents = soup.findAll('div',attrs={
    
    'class':'vF_detail_content_container'})
for content in contents:
    正文.append(str(content))
# 使用 lxml
from lxml import etree
tree = etree.HTML(html)
contents = tree.xpath('//div[@class="vF_detail_content_container"]')
for content in contents:
    正文.append(etree.tostring(content,encoding='utf8').decode('utf8'))

It can be seen that both methods can correctly capture and extract information. Of course, friends can use whichever they are familiar with. Lao Gu is not familiar with these. . . I don't know if there is a simpler situation without garbled characters

Use json to process json data

What’s more, as mentioned in this article, the captured data is json data, so don’t think too much at this time, just use json.loads directly, let’s use this one from Station B

url = 'https://api.bilibili.com/pgc/season/index/result?season_version=-1&spoken_language_type=-1&area=-1&is_finish=-1%C2%A9right&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page=1&copyright=-1&season_type=1&pagesize=20&type=1'
import requests
headers={
    
    
    "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.63"
}
response = requests.get(url,headers=headers)
data = response.content.decode('utf8')
# 使用 json 直接加载数据
import json
json_data = json.loads(data)
for k in json_data:
    print(k,json_data[k])

After loading, we need to find the depth and path of the data we need

lst = json_data['data']['list']
for k in lst:
    print(k['title'],k['index_show'],k['link'])

insert image description here

Use execjs to get data in js

Before using execjs, install the pyexecjs package first. Note that the name of the installed package is different from the referenced name. Since Lao Gu has not been able to solve the problem of undefined window and undefined document, he still uses some other means. Take the little friends who collect novels as an example

import requests
url = 'https://yc.ifeng.com/book/3303804/10/'
headers = {
    
    
    'host':'yc.ifeng.com',
    'referer':'https://yc.ifeng.com/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
    }
response = requests.get(url,headers=headers)
html = response.content.decode('utf8')
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
scripts = soup.findAll('script')
# 查到小说正文在第11个script里
#print(scripts[10].text)
# 替换掉 jquery 指令,并把后续的 jquery 指令删除
js = scripts[10].text.replace('$("#partContent").html','var content = ').split('$')[0]
import execjs
ctx = execjs.compile(js)
result = ctx.eval('content')
print(result)

insert image description here
The good thing about this method is that he can use the decoding in js, so we don’t need to decode it anymore

The selenium execution js that is no longer introduced here

I won’t introduce it here. Lao Gu really doesn’t like things that can run silently in the background and jump to the front. When Lao Gu used c# to do collections, he didn’t like to use the webbrowser control. Forgive Lao Gu for not being kind enough, I really don’t want to use it this

A rare xml source file html

In Lao Gu's collection career, he came across such a wonderful website, see the page is html, view the element is also html, view the source file. . . It becomes xml with xslt style. Of course, if xml does not have xslt, then there is no way to turn it into html. Here is an example: https://www.govinfo.gov/content/pkg/BILLS-117hr3237ih/xml/BILLS-117hr3237ih.xml, if you are interested Friends can take a look around. Of course, this example is actually quite good. It is relatively regular, and there are more complicated ones that require xslt to perform calculations. Lao Gu couldn’t find the example for a while, so I won’t give an example. In addition, I won’t elaborate on how to deal with it here. You can refer to Lao Gu’s article on python learning xml

Some other common grammatical errors, such as subscript out of bounds, will not be discussed in this article. . . This kind of mistake has nothing to do with the language, it is completely because the friends are not careful and impatient

Guess you like

Origin blog.csdn.net/superwfei/article/details/129481133