Python crawler lxml's etree class crazy slaughter piracy

We won't introduce much about the installation of python

This URL provides a detailed introduction to lxml

lxml.etree Tutorial  by Stefan Bernel

First of all, let me remind you that we need to simulate the website mode to crawl the data of the webpage from the webpage, instead of saying that I am the code and I want to crawl your data, so that we will get a ban. (Smart people) If you use the campus network, people in the whole school may not be able to log in to this website. Using a mobile phone hotspot can avoid this problem (just turn on the airplane mode, wait for a few minutes, and the base station will assign another IP address, so that We can continue to sin hahahahahaha!!)

Here we use the requests library:

If successful it will show succeeds

lxml library:

 

After that, we started to find a Biquge website (see Biquge_Book Friends’ Most Worthy Collection of Online Novels!_beqege.com for pirated versions )

 I'm using the genuine one here (it's hard to say)

Find a novel you like and get the url:

as a beginner i used xpath tool

It can help us beginners to simplify the processing process and directly locate the required text.

Here we only need to locate the title and content of the novel. Of course, under the batching of crawlers, we also need to judge whether there is a next chapter and obtain the url address of the next chapter.

After that comes the code phase

Request the response interface through the requests library, and then select the text we need through the etree class of lxml.

After that, the text is written into a folder, continuously crawled and then appended to the text.

Below is my full code

import sys
import time
import requests
from lxml import etree

base_url = "https://www.bbiquge.net/book/133312/56524593.html"

header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.79'
}


def main():
    url = base_url
    with open('output.txt', 'w', encoding='utf-8') as file:
        for b in range(0,2):
            response = requests.get(url=url, headers=header,timeout=30)
            html = etree.HTML(response.text)
            pattern1 = html.xpath("/html/body/div[@id='main']/h1")
            pattern2 = html.xpath("/html/body/div[@id='main']/div[@id='readbox']/div[@id='content']")

            url1 = html.xpath("/html/body/div[@id='main']/div[@id='readbox']/div[@class='papgbutton']/a[@id='link-next']/@href")
            url = url1[0] if url1 else None
            for j in pattern1:
                file.write('\n'.join(j.xpath('./text()')))
                file.write('\n')
                for i in pattern2:
                    file.write('\n'.join(i.xpath('./text()')))
                    file.write('\n')
            time.sleep(2)
    return 0


main()
if __name__ == "__main__":
    sys.exit(main())

 xpath compressed file https://pan.baidu.com/s/14FUbSoM1u8JA1z7uZHX1wg

 Password: 1kbc

Decompress it directly into a file, and the edge browser needs to configure the plug-in locally (Google can download it directly in the plug-in).

  (Xiaobai Chunchun records his own reasoning, please don’t spray it, if there is something wrong, please correct me)

Guess you like

Origin blog.csdn.net/date3_3_1kbaicai/article/details/131692008