Preface
Use python to write a simple Biquge crawler, crawl the entire novel according to the entered novel URL and save it to a txt file. The crawler uses the select method of the BeautifulSoup library and the result is shown in the figure:
This article is only for learning crawlers
1. Web page analysis
Here is the example URL of Douluo Mainland Novel: http://www.biquge001.com/Book/2/2486/
It can be found that the web page address and chapter name of each chapter are placed in the a tag in <"div id=list dl dd a>, so the URL and chapter name can be obtained by using the select method in BeautfulSoup
Tag = BeautifulSoup(getHtmlText(url), "html.parser") #here getHtmlText is a method to get html written by myself urls = Tag.select("div #list dl dd a")
Then iterate through the list
for url in urls: href = "http://www.biquge001.com/" + url['href'] # The splicing of strings into the correct URL pageName = url.text # The name of each chapter
Then the content of each chapter is stored in <div id="content" in the same way
substance = Tag.select("div #content") # The content of the article
Finally, in the same way, get the name of the novel on the homepage <"div id = info h1>
bookName = Tag.select("div #info h1")
2. Fill in the code
1. Obtain Html and write method
def getHtmlText(url): r = requests.get(url, headers=headers) r.encoding = r.apparent_encoding # 编码转换 r.raise_for_status() return r.text def writeIntoTxt(filename, content): with open(filename, "w", encoding="utf-8") as f: f.write(content) f.close() print(filename + "已完成")
2. The rest of the code
The code is as follows (example):
url = "http://www.biquge001.com/Book/2/2486/" substanceStr = "" bookName1 = "" html = getHtmlText(url) # Determine whether this file exists Tag = BeautifulSoup(getHtmlText(url), " html.parser") urls = Tag.select("div #list dl dd a") bookName = Tag.select("div #info h1") for i in bookName: bookName1 = i.text if not os.path.exists (bookName1): os.mkdir(bookName1) print(bookName1 + "Created") else: print("File has been created") for url in urls: href = "http://www.biquge001.com/" + url ['href'] # Concatenation of strings to form the correct URL pageName = url.text # Chapter name of each chapter text # Chapter name of each chapter path = bookName1 + "\\" # Path fileName = path + url.text + ".txt" # File name = path + chapter name + ".txt" Tag = BeautifulSoup(getHtmlText(href), "html.parser") # Parse each page substance = Tag.select("div #content") # The content of the article for i in substance: substanceStr = i.text writeIntoTxt(fileName , substanceStr) time.sleep(1)
to sum up
Simple use of BeautfulSoup's select method to crawl the webpage of Biquge
Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself