Python3 implements crawler urllib articles + data processing (using bs4) (2)

This time I will introduce some detailed usage of the urllib library and BeautifulSoup.
I will only talk about how to use and data processing. If some functions do not understand what they do, or want to know about Exception handling, please refer to the previous article: https://blog.csdn.net/qq_36376711 /article/details/86614578

urllib is a built-in library for python3, and bs4 needs pip install bs4 under cmd. If it fails, it is basically a problem with your environment variable setting or pip

request part:

request access method one:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url can be a string or a request object, usually an HTTP/HTTPS link address

Data is generally ignored, currently only HTTP/HTTPS uses data

timeout specifies the connection timeout time, only valid for HTTP, HTTPS, FTP connections

The optional cafile and capath parameters specify a set of trusted CA certificates for HTTPS requests.

Cadefault don't care

context is an instance of ssl.SSLContext describing various SSL options

from urllib import request

url = "https://docs.python.org/3.7/library/urllib.html"
#urlopen是request库中最简单的访问方法
content = request.urlopen(url)
#显示你访问的url地址
print(content.geturl())
#以email.message_from_string()实例的形式返回页面的元信息,例如标题
#参考:https://docs.python.org/3.7/library/email.parser.html#email.message_from_string
print(content.info())
#html访问成功则返回200
print(content.getcode())

html = content.read().decode("utf-8")

A http.client.HTTPResponse object will be returned if HTTP or HTTPS access is successful, and
a subclass of exception http.client.HTTPException will be returned if it fails

urllib.request.urlopen() corresponds to urllib2.urlopen in the old version

Request access method two:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
This is the abstract class of url request

url2 = "https://www.csdn.net/"
#构造头数据,伪装成正常浏览器访问
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'}
content2 = request.Request(url=url2, headers=headers)
html2 = content2.read().decode("utf-8") 

Press F12 or right-click in the browser to inspect the element, find the network, and perform some operations. Take the example of entering some text to be translated when accessing Google Translate. After inputting, it is found that data exchange occurs. Click on any item to find Request Headers. Which headers parameters are actually used need you to check the website you want to crawl by yourself, and analyze the specific situation. Some unnecessary parameters can also not be filled in, this requires a patient attempt.

If you can’t understand just by reading the text, please see the graphic description:
https://blog.csdn.net/qq_36376711/article/details/86679266

headers
Some examples given by urllib documentation

with urllib.request.urlopen('http://www.python.org/') as f:
	print(f.read(300))

beautifulsoup part:

bs4: Python library for extracting data from HTML or XML files
Chinese documentation:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
and bs4 similar modules are html.parser(python comes with it) and lxml

from urllib import request
from bs4 import BeautifulSoup
url = "https://blog.csdn.net/qq_36376711/article/details/86675208"
html = request.urlopen(url).read().decode("utf-8")
soup = BeautifulSoup(html,'html.parser')

The html.parser parameter is used to specify the parser
Parser classification

#也可以直接通过本地html文档或者传入一个字符串(符合html语法的字段)直接构建bs对象
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

The above method will convert the document to Unicode, and all HTML instances will be converted to Unicode encoding

#获取所有文字部分
print(soup.get_text())
#格式化输出获取到的soup
print(soup.prettify())

#通过标签访问标题,因为标题的字体是h1标签内的内容
#soup.h1包含标签,soup.h1.get_text()不含标签
print(soup.h1.get_text())
#方式不同,但和上一句结果相同
print(soup.h1.string)

Find the heading tags:
h1
soup.h1:
soup.h1
soup.h1.get_text():
soup.h1.get_text

find_all( name , attrs , recursive , text , **kwargs )

find( name , attrs , recursive , text , **kwargs )

name refers to the tag name (html, span, div, p, h1, etc.), attr refers to the tag attribute value/class (category), recursive specifies whether to search recursively, the default is True

#利用soup.h1输出,会发现,h1类标签(不懂的话你就理解成字体最大的)明明不止标题,所以可以看出
#soup.h1找到第一个符合条件的值就会停止搜索,所以我们采用find_all()来找出所有满足条件的
#findAll()与find_all()都可用
text_h1 = soup.find_all("h1")
for text in text_h1:
    print(text.get_text())
#通过属性值,如class查找,上面的h1是通过标签查找
#有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:
print(soup.find_all("title")[0].get_text())
#通过id查找,id=True表示查找所有具有id属性的标签
print(soup.find_all(id="article_content"))

#soup.find_all("a") is equivalent to soup("a")

#find也是找到第一个满足条件的就会停止
print(soup.find(h1).get_text())

The simple description of the text parameter is inconvenient to understand, so go directly to the Chinese documentation of bs
text parameter description

#也可以通过string访问其值
print(soup.h1.string)#前面也提到过
#访问标签名
print(soup.h1.name)
#访问标签的父节点(上级标签)#同理可访问兄弟节点和子节点
print(soup.h1.parrent)
#可以直接修改soup的属性:tag[attribute] = "value",如
h1[class] = "title"
#应用场景:批量修改部分标签属性等
#通过标签加类别
print("文章阅读数:",soup.find_all("span","read-count")[0].get_text())
#找出所有的h1-h6标签
h_list = soup.find_all({"h1","h2","h3","h4","h5","h6"})
#由于find_all返回的是个列表,所以用循环打印输出,不能直接用.get_text()
#或者像获取文章阅读数一样用 [0]去获取
for h in h_list:
	print(h.get_text())

下面两句完全等价,class为python保留字,所以用class_
soup.find_all(class_ = "recommend-right")
soup.find_all("",{"class":"recommend-right"})
#利用html节点的contents属性获取数据,当该节点没有子节点时会报错
for content in soup.head.contents:
	#content没有string属性和get_text()方法
    print(content)

#Child node access
for child in soup.head.children:
print(child)

True matches all tags, but does not return string nodes, which is convenient for viewing which fields the html page has

for tag in soup.find_all(True):
    print(tag.name)

You can also use regular expression matching to find data in combination with the re library, which was explained in the previous article and will not be repeated here.

In addition, there are other search functions available, and search according to css
such as:
find_parents( name , attrs , recursive , text , **kwargs )
find_parent( name , attrs , recursive , text , **kwargs )
find_all_next( name , attrs , recursive , text , **kwargs )
find_next( name , attrs , recursive , text , **kwargs )
...
the result of the article is a bit confusing, and it will be refactored later (funny)

This blog is original and written based on the knowledge the author has learned in the book "Python Network Data Collection" and official documents.

You are free to modify and publish this blog post, you can even say that you wrote it yourself.

The next article is scheduled to explain how to crawl a large amount of page information and deal with a simple anti-crawler mechanism.

Guess you like

Origin blog.csdn.net/qq_36376711/article/details/86675208