Python Crawler: Summary of HTML Web Page Parsing Methods

To understand how python parses web pages, you must first understand what a web page parser is.

To put it simply, it is a tool for parsing html webpages. To be precise, it is an information extraction tool for HTML webpages, which is to parse and extract "valuable data we need" or "new URL links" from html webpages. tool.

Parsing HTML:

  • hierarchical data
  • There are multiple third-party libraries for parsing HTML, such as: LXML, BeautifulSoup, HTMLParser, etc.
  • Problems faced by parsing HTML: There is no unified standard, and many web pages do not follow HTML documents

We know that the principle of the crawler is nothing more than downloading the content of the target website and storing it in memory. At this time, its content is actually a bunch of HTML, and then parse the HTML content to extract the desired data according to your own ideas.

Today I mainly talk about four methods of parsing HTML content of web pages in Python:

  • BeautifulSoup
  • lxml XPath
  • requests-html
  • regular expression

Among them, BeautifulSoup and XPath are two commonly used libraries in python for parsing web pages. They are powerful tools for novices. For beginners with zero foundation, it is recommended to lay a good foundation in Python before getting started with crawlers.

**" How to learn Python with zero foundation "** I saw a good article on csdn, which is very practical. You can take a look at it, and share the link below.

Any suggestions for learning Python with zero foundation?

1、Beautiful Soup

The famous BeautifulSoup library is a heavyweight library in Pyhton's HTML parsing library.

Installation path:

pip install beautifulsoup4

The first step in parsing is to build a BeautifulSoup object.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, 'html.parser')

The second parameter indicates the parser, and BeautifulSoup supports the following interpreters:

img

BeautifulSoup corresponds to the entire content of an HTML/XML document;

img

Basic elements of the BeautifulSoup class

img

Any tag existing in the HTML syntax can be accessed with soup., and when there are multiple identical corresponding contents in the HTML document, soup. returns the first one.

Each has its own name, obtained through .name,
attrs of string type Tag: one can have 0 or more attributes, dictionary type
NavigableString can span multiple levels

1) Access Tab

Through the dot operator, you can directly access specific tags in the document, for example:

>>> soup = BeautifulSoup(html_doc, 'lxml')
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.head.title
<title>The Dormouse's story</title>
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

This method will only return the first label in the document each time, and for multiple labels, return a list of multiple labels through the find_all method.

>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find_all('a')[0]
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

You can also add filter conditions in the find method to locate elements more precisely.

# 通过text进行筛选
>>> soup.find_all('a', text='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
# 通过属性和值来进行筛选
>>> soup.find_all('a', attrs={'id':'link1'})
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
# 上述写法的简便写法,只适合部分属性
>>> soup.find_all('a', id='link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
```
# 使用CSS选择器
# 注意class后面加下划线
>>> soup.find_all('p', class_='title')
[<p class="title"><b>The Dormouse's story</b></p>]

2) Access tag content and attributes

The name and content of the tag can be accessed through name and string, and the attributes and values ​​​​in the tag can be accessed through get and bracket operators.

>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> soup.a['class']
['sister']
>>> soup.a.get('class')
['sister']
>>> soup.a.name
'a'
>>> soup.a.string
'Elsie'

Combining the methods of locating elements and accessing attributes, the corresponding elements can be extracted conveniently and quickly, and the convenience of parsing HTML is improved.

Use the Beautiful Soup library to parse web pages

 import requests
 import chardet
 from bs4 import BeautifulSoup
 ua = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) Chrome/65.0.3325.181'}
 rqg=requests.get(url,headers=ua,timeout=3.0)
 rqg.encoding = chardet.detect(rqg.content)['encoding'] #requests请求过程
 #初始化HTML
 html = rqg.content.decode('utf-8')
 soup = BeautifulSoup(html, 'lxml')  #生成BeautifulSoup对象
 print('输出格式化的BeautifulSoup对象:',soup.prettify())
 
 print('名为title的全部子节点:',soup.find_all("title"))
 
 print('title子节点的文本内容:',soup.title.string)
 print('使用get_text()获取的文本内容:',soup.title.get_text())
 
 target = soup.find_all('ul',class_='menu') #按照css类名完全匹配
 target = soup.find_all(id='menu') #传入关键字id,搜索符合条件的节点
 target = soup.ul.find_all('a') #所有名称为a的节点

BeautifulSoup parsing content also needs to separate the request from the parsing. From the perspective of code clarity, it will be fine, but the code is a bit cumbersome when doing complex parsing. Generally speaking, it is not bad to use, it depends on personal preference.

The two most important points of the basic skills of crawlers: how to grab data? How to parse the data? We have to learn and use flexibly, and use the most effective tools at different times to accomplish our goals.

The tool is second, don’t put the subject upside down when learning. The article I shared above also mentioned this issue (the link is placed below), and you must be clear about your ultimate goal of learning?

Click here to view: Any suggestions for learning Python with zero foundation?

2. XPath of lxml

The lxml library supports both HTML and XML parsing, supports XPath parsing mode, and the parsing efficiency is quite high, but we need to be familiar with some of its rule syntax to use it.

To use xpath, you need to import the etree module from the lxml library, and you also need to use the HTML class to initialize the HTML object that needs to be matched.

 import requests
 import chardet
 from lxml import etree
 ua = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) Chrome/65.0.3325.181'}
 rqg=requests.get(url,headers=ua,timeout=3.0)
 rqg.encoding = chardet.detect(rqg.content)['encoding'] #requests请求过程
 #初始化HTML
 html = rqg.content.decode('utf-8')
 html = etree.HTML(html,parser=etree.HTMLParser(encoding='utf-8')) 

Installation path:

pip install lxml

XPath Common Expressions

img

Use expressions to target head and title nodes

 result = html.xpath('head') #通过名称定位head节点
 result1 = html.xpath('/html/head/title') #按节点层级定位title节点
 result2 = html.xpath('//title') #另一种方式定位title节点

Commonly used expressions in XPath predicates

img

Use the predicate to locate the head and ul nodes

 result1 = html.xpath('//header[@class]') #定位header节点
 result2 = html.xpath('//ul[@id="menu"]') #定位ul节点

Locate and get the text content in the title node

title = html.xpath('//title/text()')

Extract all text files and link addresses under the ul node

 connect = html.xpath('//ul[starts-with(@id,"me")]/li//a/text()') #需for循环输出文本文件
 url_list = html.xpath('//ul[starts-with(@id,"me")]/li//a/@href') #需for循环输出链接地址

The parsing syntax of XPath is a bit complicated, but if you are familiar with the grammar, it is also an excellent parsing method.

img

Common Grammar Rules

Example:

import requests
from lxml import etree
 
url = "https://movie.douban.com/"
ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240"
 
 
with requests.request('GET',url,headers = {'User-agent':ua}) as res:
    content = res.text          #获取HTML的内容
    html = etree.HTML(content)  #分析HTML,返回DOM根节点
    #path = //div[@class='billboard-bd']//td//a/text()
    orders = html.xpath("//div[@class='billboard-bd']//tr/td[@class='order']/text()")
    titles = html.xpath( "//div[@class='billboard-bd']//td//a/text()")  #使用xpath函数,返回文本列表
    print(orders)
    print(titles)
    ans = dict(zip(orders,titles)) #豆瓣电影之本周排行榜
    for k,v in ans.items():
        print(k,v)

3、requests-html

We know that requests is only responsible for network requests and will not parse the response results, so requests-html can be understood as a requsts library that can parse HTML documents.

The amount of code in requests-html is very small, and it is based on the existing framework for secondary packaging. It is more convenient for developers to call when using it. It relies on libraries such as PyQuery, requests, and lxml.

Installation path:

pip install requests-html

It should be noted that this library currently only supports python3.6 version;

requests-html has the following properties:

  • Full support for JavaScript
  • CSS selectors
  • XPath selector
  • Simulates a user agent (like a real web browser)
  • Automatically follow redirects
  • Connection pool and cookie persistence

requests-html uses the request mode of session keeping by default, and its return content is an object with rich methods.

Get a random User-Agent

There is no need to copy user-agent in the request header every time;

# 自动生成一个useragent,默认为谷歌浏览器风格
user_agent = requests_html.user_agent()

The support for JavaScript is the biggest highlight of requests-html, and the render function will be used. It should be noted that the first time you use this method, it will download Chromium first, and then use Chromium to execute the code, but you may need a ladder when downloading , we will not discuss it here.

Those who have learned the requests library should be familiar with the API of requests-html, and the usage methods are basically the same. The difference is that when using requests to write a crawler, the webpage must be crawled first, and then handed over to some html parsing libraries such as BeautifulSoup. Now it can be parsed directly.

Example:

from requests_html import HTMLSession

session = HTMLSession()

def parse():
    r = session.get('http://www.qdaily.com/')
    # 获取首页新闻标签、图片、标题、发布时间
    for x in r.html.find('.packery-item'):
        yield {
            'tag': x.find('.category')[0].text,
            'image': x.find('.lazyload')[0].attrs['data-src'],
            'title': x.find('.smart-dotdotdot')[0].text if x.find('.smart-dotdotdot') else x.find('.smart-lines')[0].text,
            'addtime': x.find('.smart-date')[0].attrs['data-origindate'][:-6]
        }

With a few short lines of code, you can grab the articles on the entire homepage.

Several methods used in the example:

① find( ) can receive two parameters:

The first parameter can be a class name or ID. When the second parameter first=True, only the first piece of data is selected.

② text Get the text content of the element

③ attrs Get the attributes of the element, and the return value is a dictionary

④ html Get the html content of the element

The advantage of using requests-html to parse the content is that the author has highly encapsulated it, and even the encoding format conversion of the content returned by the request is automatically done, which can completely make the code logic simpler and more direct, and focus more on the parsing work itself.

4. Regular expressions

Regular expressions are usually used to retrieve and replace text that matches a certain pattern, so we can use this principle to extract the information we want.

To use regular expressions, you need to import the re module, which provides full regular expression functionality for Python.

Example of strict character matching:

  • look up
import re
example_obj = "1. A small sentence. - 2. Another tiny sentence. "
re.findall('sentence',example_obj)#第一个参数为想要查找的字符,第二个参数为被查找的句子
re.search('sentence',example_obj)
re.sub('sentence','SENTENCE',example_obj)
re.match('.*sentence',example_obj)

Note: Python only supports the re module for regular expression writing

import re
string = "1. A small sentence. - 2. Another tiny sentence."
  • findall(): This method is generally used more
re.findall('sentence',string)#把所有符合要求的提取出来
>>>['sentence', 'sentence']
  • search()
re.search('sentence',string)#只返回一个位置(第一个找到就停止搜索)可能遍历更快
>>><_sre.SRE_Match object; span=(11, 19), match='sentence'>
  • match()
re.match('sentence',string)#该方法必须被查询语句的首字母就为查询字段,此时才会有相应结果的返回
  • Substitution: sub(pattern, repl, string)
  • Removed: Replaced repl's argument with an empty ' '

Use regular expressions to find title content in web page content:

import requests
 import chardet
 import re
 ua = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) Chrome/65.0.3325.181'}
 rqg=requests.get(url,headers=ua,timeout=3.0)
 rqg.encoding = chardet.detect(rqg.content)['encoding'] #requests请求过程
 
 title_pattern = r'(?<=<title>).*?(?=</title>)'
 title_com = re.compile(title_pattern,re.M|re.S) #接受规则
 title_search = re.search(title_com,rqg.text) #使用正则表达式

It is not possible to use regular expressions to locate a specific node and obtain the links and text content in it, but it is more convenient to use XPath and Beautiful Soup to realize this function.

Regex is cumbersome to write and not easy to understand, but the matching efficiency is very high. However, after there are many ready-made HTML content parsing libraries, it is not recommended to manually match the content with regex, which is time-consuming and labor-intensive.

5. Summary:

(1) Regular expression matching is not recommended, because there are already many ready-made libraries that can be used directly. We don’t need to define a large number of regular expressions, and they can’t be reused. Xiaobai who has tried regular expressions can understand it. Use How strenuous is it to filter web page content with regular expressions, and it always feels that the effect is not very good.

(2) BeautifulSoup is a DOM-based approach. Simply put, it loads the entire web page content into the DOM tree during parsing. The memory overhead and time-consuming are relatively high, and it is not recommended to use it when processing massive content.

BeautifulSoup does not need clear-structured webpage content, because it can directly find the tags we want, and it is more suitable for some webpages with unclear HTML structure.

(3) XPath is parsed based on the SAX mechanism. It does not load the entire content into the DOM like BeautifulSoup, but parses the content based on an event-driven approach, which is more lightweight.

However, XPath requires a clear web page structure, and it is more difficult to develop than DOM parsing. It is recommended to use it when parsing efficiency is required.

(4) requests-html is a relatively new library, which is highly encapsulated and has clear source code. It directly integrates a large number of tedious and complicated operations during parsing, and supports both DOM parsing and XPath parsing. It is flexible and convenient, and you can try it.

In addition to the several methods of web page content analysis introduced above, there are many analysis methods, so I won’t introduce them one by one here.

Guess you like

Origin blog.csdn.net/m0_59162248/article/details/129764151