Write a simple Biquge crawler with python60 lines of code! One-third chapter?

Preface

Use python to write a simple Biquge crawler, crawl the entire novel according to the entered novel URL and save it to a txt file. The crawler uses the select method of the BeautifulSoup library and the result is shown in the figure:

This article is only for learning crawlers

1. Web page analysis

Here is the example URL of Douluo Mainland Novel: http://www.biquge001.com/Book/2/2486/

It can be found that the web page address and chapter name of each chapter are placed in the a tag in <"div id=list dl dd a>, so the URL and chapter name can be obtained by using the select method in BeautfulSoup

Tag = BeautifulSoup(getHtmlText(url), "html.parser") #here getHtmlText is a method to get html written by myself 
urls = Tag.select("div #list dl dd a")

Then iterate through the list

for url in urls: 
    href = "http://www.biquge001.com/" + url['href'] # The splicing of strings into the correct URL 
    pageName = url.text # The name of each chapter

Then the content of each chapter is stored in <div id="content" in the same way

 

substance = Tag.select("div #content") # The content of the article

Finally, in the same way, get the name of the novel on the homepage <"div id = info h1>

 

 

bookName = Tag.select("div #info h1")

2. Fill in the code

1. Obtain Html and write method

def getHtmlText(url):
    r = requests.get(url, headers=headers)
    r.encoding = r.apparent_encoding  # 编码转换
    r.raise_for_status()
    return r.text

def writeIntoTxt(filename, content):
    with open(filename, "w", encoding="utf-8") as f:
        f.write(content)
        f.close()
        print(filename + "已完成")

2. The rest of the code

The code is as follows (example):

url = "http://www.biquge001.com/Book/2/2486/" 
substanceStr = "" 
bookName1 = "" 
html = getHtmlText(url) 
# Determine whether this file exists 
Tag = BeautifulSoup(getHtmlText(url), " html.parser") 
urls = Tag.select("div #list dl dd a") 
bookName = Tag.select("div #info h1") 
for i in bookName: 
    bookName1 = i.text 
if not os.path.exists (bookName1): 
    os.mkdir(bookName1) 
    print(bookName1 + "Created") 
else: 
    print("File has been created") 
for url in urls: 
    href = "http://www.biquge001.com/" + url ['href'] # 
    Concatenation of strings to form the correct URL pageName = url.text # Chapter name of each chapter text # Chapter name of each chapter 
    path = bookName1 + "\\" # Path
    fileName = path + url.text + ".txt" # File name = path + chapter name + ".txt"
    Tag = BeautifulSoup(getHtmlText(href), "html.parser") # Parse each page 
    substance = Tag.select("div #content") # The content of the article 
    for i in substance: 
        substanceStr = i.text 
    writeIntoTxt(fileName , substanceStr) 
    time.sleep(1)

to sum up

Simple use of BeautfulSoup's select method to crawl the webpage of Biquge

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112521161