Use Python: XPath to extract cat's eye movies

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it yourself

Python free learning materials and group communication answers Click to join


Use XPath to extract the top 100 Maoyan movies.https://maoyan.com/board/4

XPath data extraction

XML introduction

XML is called Extensible Markup Language. XML is an important tool for Internet data transmission. It can span any platform on the Internet without being restricted by programming languages ​​and operating systems. It can be said that it is a data carrier with the highest level of Internet access. Very similar to HTML.

The difference between HTML and XML is that HTML is mainly used to display data, and XML is used to transmit data.

XML is closed by tags. For example:… appears in pairs.

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

  <book category="奇幻">
    <title lang="ch">冰与火之歌</title>
    <author>乔治 马丁</author>
    <year>2005</year>
    <price>365.00</price>
  </book>

  <book category="童话">
    <title lang="ch">哈利波特与死亡圣器</title>
    <author>J K. 罗琳</author>
    <year>2005</year>
    <price>48.98</price>
  </book>

  <book category="编程">
    <title lang="ch">Python编程-从入门到放弃</title>
    <author>挖掘机小王子</author>
    <year>2048</year>
    <price>99.00</price>
  </book>

  <book category="web" cover="paperback">
    <title lang="en">Python编程-从看懂到看开</title>
    <author>尼古拉斯-赵四</author>
    <year>2003</year>
    <price>39.95</price>
  </book>

</bookstore>

In the above xmlgrammar, there is a relationship between father and son, ancestor, etc.

XPath introduction

XPath (XML Path Language) is a language for finding information in XML documents. It can be used to traverse elements and attributes in XML/HTML documents and extract corresponding elements.

It is also a data extraction method, but only for HTML/XML data, because crawlers mainly deal with HTML pages.

XPath matching rules

The following table is the rules commonly used in XPath:
Insert picture description here
lxml library

lxml is a Python third-party module. The main function is how to parse and extract HTML/XML data.

Similar to regular, lxml is a high-performance Python HTML/XML parser. We can use the XPath syntax we learned before to quickly locate specific elements and node information.

  • installation:pip install lxml

If online installation is unsuccessful, use offline installation.

When parsel is installed again, lxml will be installed automatically, so there is no need to install it again.

Use lxml module

Initialization generates an XPath parsing object, and at the same time can automatically complete incomplete HTML tags. Incoming web source code.

from lxml import etree

string = """
  <book category="web" cover="paperback">
    <title lang="en">Python编程-从看懂到看开</title>
    <author>Python编程</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
"""

# 再解析之前必须先转化一下
html = etree.HTML(string)
# 返回结果是列表
result = html.xpath("//book[contains(@cover,'paper')]/title/text()")
result = html.xpath("//book[4]/title/text()")

print(result)

Because parselof the lxmlcarried out, using the seamless switching can be directly re parsel in xpath.

Use XPath to select the specified content. XPath syntax rules are written in parentheses. Back to list.

# -*- coding: utf-8 -*-
import requests
import parsel

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
}

response = requests.get('https://maoyan.com/board/4?offset=0', headers=headers)
html = response.text

# %% 选择任意节点
sel = parsel.Selector(html)
# 提取 p 标签
ps = sel.xpath('//p')
for p in ps:
    print(p.get())

Case: XPath extracts cat's eye movies

'''
猫眼电影:
https://maoyan.com/board/4?offset=20
    函数式:
        1. 获取(请求)一页的函数
        2. 定义解析函数(解析一页)
        3. 写入文件函数
        4. 循环函数
        5. python学习交流群:1136201545
'''
import json
import requests
from lxml import etree


# 获取响应
def getOnePage(url):
    '''获取一页的响应的函数'''

    response = requests.get(url)

    return response.text


# 解析响应 --> 结果
def parseOnePage(text):

    # 初始化解析
    html = etree.HTML(text)
    # 里面有所有的数据 先选择上一层  这一层里面包含所有数据 然后循环遍历
    data = html.xpath('//dl[@class="board-wrapper"]')
    # 遍历提取所有的元素
    for dat in data:
        print(dat)
        # 继续选取
        # 标题
        title = dat.xpath('.//div//a/text()')
        # 主演
        star = dat.xpath('.//p[@class="star"]/text()')
        # 时间
        releasetime = dat.xpath('//p[@class="releasetime"]/text()')

        for tit, sta, rel in zip(title, star, releasetime):
            # 在函数里面遇到return就终止
            # 生成器
            yield {
                '电影名字': tit,
                '主演': sta.strip(),
                '上映时间': rel
            }


def save2File(data):
    # with open('maoyan66.txt', 'a', encoding='utf-8') as fp:
    #     fp.write(data+'\n')

    with open('maoyan66.txt', 'a', encoding='utf-8') as fp:
        fp.write(json.dumps(data, ensure_ascii=False)+'\n')


if __name__ == "__main__":

    for page in range(10):

        # 一页网址
        url = f'https://maoyan.com/board/4?offset={page*10}'
        # 调用
        r = getOnePage(url)
        # 解析数据 返回生成器
        result = parseOnePage(r)
        for res in result:
            # with open('maoyan.txt','a',encoding='utf-8') as fp:
            #     # 打印到文件
            #     print(str(res), file=fp)
            save2File(str(res))

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/112787203