Python programming examples|Crawling novels from the Internet

 

Internet literature is an important field in my country's popular culture in the new century, and young people have a widespread love for online novels. This article takes crawling the text of online novels as an example to write a simple and practical crawler script.

01. Analyze web pages

Many people like to read online novels locally. In other words, they download the novels to their mobile phones or other mobile devices for reading. Not only are they not restricted by the network, but they can also use the reading app to adjust the display style they like. Unfortunately, major websites rarely provide the download function of the entire novel, and only some websites allow VIP members to download multiple chapters. For ordinary readers, although VIP chapters need to be purchased to read, they at least hope to read a large number of free chapters in one go. Users can completely use a crawler program to help them download all free chapters of a novel into a TXT file for easy reading on other devices (we also remind everyone to support genuine versions, stay away from piracy, and improve intellectual property awareness).

Taking Zhulang Novel Network as an example, we select a popular novel (or one that everyone is interested in) from the ranking list for analysis. The first is the homepage of the novel, which includes various information (such as the introduction of the novel, the latest chapters, reader comments, etc.), followed by a chapter list page (some websites also call it the "latest chapter" page), and each chapter of the novel has a separate page. Obviously, if the user can use the chapter list page to collect the URL addresses of all chapters, then we only need to use a program to capture the content of these chapters separately and write the content into a local TXT file to complete the novel capture.

After viewing the chapter page, the user unfortunately found that the chapter content of the novel was loaded using JS, and the entire page used a large number of CSS and JS-generated effects, which made it a little more difficult for the user to crawl. It is unrealistic to use requests or the urllib library to directly request the URL of the chapter page, but users can use Selenium to easily solve this problem. For a small-scale task, the cost in performance and time is still acceptable.

Next, let’s analyze how to position text elements. Using developer mode to view the element (see Figure 1), the user found that the value of the ID read-content can be used to locate the text. However, the value of class is also read-content. In theory, it seems that the class name can be used for positioning. However, Selenium does not currently support direct positioning of composite class names, so the idea of ​​using class for positioning can only be given up first.

■ Figure 1 Novel chapter content in developer mode

Tip /

Although Selenium currently only supports positioning of simple class names, users can use CSS selection to position compound class names. If you are interested, you can learn about the find_element_by_css_selector() method in Selenium.

02. Write a crawler

Use Selenium with Chrome for this crawl. In addition to installing Selenium with pip, you first need to install ChromeDriver. You can visit the following address to download it locally:

https://sites.google.com/a/chromium.org/chromedriver/downloads

After entering the download page (see Figure 2), download according to the version of your system.

■ Figure 2 ChromeDriver download page

Afterwards, use the selenium.webdriver.Chrome(path_of_chromedriver) statement to create a Chrome browser object, where path_of_chromedriver is the path to the downloaded ChromeDriver.

In the script, the user can define a crawler class named NovelSpider, which is initialized using the "all chapters" page URL of the novel (similar to the "construct" in C language). It also has a list attribute in which it will be stored. URLs for each chapter. The class methods are as follows.

  • get_page_urls(): Grab the URL of each chapter from all chapter pages.

  • get_novel_name(): Get the title of the current novel from all chapter pages.

  • text_to_txt(): Save the text content in each chapter to a TXT file.

  • looping_crawl(): Loop crawling.

After sorting out the ideas, you can start writing the program. The final crawler code is shown in Example 1.

[Example 1] Crawling program for online novels.

# NovelSpider.py
import selenium.webdriver,time,re
from selenium.common.exceptions import WebDriverException
class NovelSpider():
def  init (self,url):
self.homepage = url
self.driver= selenium.webdriver.Chrome(path of chromedriver)
self.page list =[]
def del (self):
self.driver.quit(
def get page urls(self):
homepage = self.homepage
self.driver.get(homepage
self.driver.save screenshot('screenshot.png')
self.driver.implicitly_wait(5)elements = self.driver.find elements by tag name('a')
for one in elements:
page_url = one.get attribute('href')
pattern ='http:\/\/book .zhulang .com\/\df6) /\d + .html
if re.match(pattern,page url):
print(page url)
self.page list.append(page_url)
def looping crawl(self):
homepage = self.homepage
filename = self.get_novel name(homepage) + '.txtself.get_page_urls()
pages = self.page_list
# print(pages)
for page in pages:
self.driver.get(page,
print('Next page:')
self.driver.implicitly wait(3)title = self.driver.find_element by_tag_name('h2').textres = self.driver.find element by id('read - content')text= n'+ title +  n'
for one in res.find elements by xpath('./p'):text += one.texttext +=n
self.text to txt(text,filename)
time.sleep(1)
print(page + '\t\t\tis Done!')
def get novel name( self,homepage):
self.driver.get(homepage)
self.driver.implicitly wait(2)
res = self.driver.find element by tag name( 'strong').find element by xpath('./a')if res is not None and len(res.text) > 0:
return res.textelse:
return 'novel
def text to txt(self,text,filename):if filename[ - 4:]!=.txt':print(Error,incorrect filename')else:
with open(filename,'a') as fp:
fp.write(text)fp.write(n')
if name == main ':
hp_url = input(输入小说"全部章节"页面:)
path of chromedriver ='your path of chrome driver!
try:
sp1 = NovelSpider(hp_url)
sp1.looping_crawl()
del sp1
except WebDriverException as e:
print(e.msg)

The __init__() and __del__() methods can be regarded as constructors and destructors, which are executed when the object is created and destroyed respectively. Initialization is done with a URL string in __init__() and Selenium browser is exited in __del__() method. The try-except statement executes the main part and tries to catch the WebDriverException exception (this is also the most common exception type when Selenium is run). In the lopping_crawl() method, several other methods mentioned above are called respectively.

The driver.save_screenshot() method is the method in selenium.webdriver to save a screenshot of the current window of the browser.

The driver.implicitly_wait() method is the implicit wait in Selenium. It sets a maximum waiting time. If the web page is loaded within the specified time, the next step will be executed. Otherwise, it will wait until the time expires before executing the next step.

Tip /

Explicit waiting will wait for a certain condition to be triggered before proceeding to the next step. It can be used in conjunction with ExpectedCondition to support customizing various judgment conditions. Implicit wait only requires one line when writing, so it is very convenient to write. Its scope is the entire life cycle of the WebDriver object instance, which will slow down the test of a normally responding application, causing the entire test execution time to become longer.

driver.find_elements_by_tag_name() is one of the many methods used by Selenium to locate elements. All methods for locating a single element are as follows.

  • find_element_by_id(): Locate according to the id attribute of the element and return the first element matching the id attribute; if no element matches, a NoSuchElementException exception will be thrown.

  • find_element_by_name(): Locate according to the name attribute of the element and return the first element whose name attribute matches; if no element matches, a NoSuchElementException exception is thrown.

  • find_element_by_xpath(): Position based on XPath expression.

  • find_element_by_link_text(): Locate hyperlinks by link text. There is also a substring matching version of this method find_element_by_partial_link_text().

  • find_element_by_tag_name(): Use HTML tag name to locate.

  • find_element_by_class_name(): Use class positioning.

  • find_element_by_css_selector(): Positioning based on CSS selector.

The method name for finding multiple elements just changes element to plural elements and returns a search result (list). The rest is the same as the above method. After locating the element, you can use the text() and get_attribute() methods to obtain the text or individual attributes.

page url = one.get attribute('href')

This line of code uses the get_attribute() method to obtain the URL address of each chapter located. In the above program, the re.match() method in re (Python's regular module) is also used to match page_url based on regular expressions. Shaped like:

'http"\/\/book\.zhulang\.com\/\d{6}\/\d+\.html'

Such a regular expression matches the following string:

http://book,zhulang.com/A/B.HTML

Among them, part A must be 6 numbers, and part B must be more than one number. This also happens to be the URL form of each chapter page of the novel. Only URL links that match this form will be added to the page_list.

The commonly used functions of the re module are as follows.

  • compile(): Compile the regular expression and generate a Pattern object. Then you can use a series of methods of Pattern to match/search the text (of course, the matching/search function also supports taking Pattern expressions directly as parameters).

  • match(): used to find the head of a string (you can also specify the starting position). It is a match and is returned as long as a matching result is found.

  • search(): Used to search anywhere in the string and return as long as a matching result is found.

  • findall(): Returns all matching substrings in list form. If there is no match, returns an empty list.

  • finditer(): Search the entire string and get all matching results. One big difference from findall() is that it returns an iterator that sequentially accesses each matching result (Match object).

  • split(): Split the string according to matching substrings and return a result list.

  • sub(): used for replacement, replacing the matched part of the parent string with a specific string.

Tip /

Regular expressions are widely used in the computer field. It is necessary for everyone to have a good understanding of its syntax.

In the looping_crawl() method, get_novel_name() is used to obtain the book title and convert it to a TXT file name, get_page_urls() to obtain a list of chapter pages, and text_to_txt() to save the captured text content. In between, various element positioning methods are used extensively (as mentioned above).

03. Run and view the TXT file

Here, select a novel - "Peerless Magic Power" by Zhulang Novel Network, run the script and enter the URL of its chapter list page. You can see the output in the console when the program is successfully run, as shown in Figure 3.

■ Figure 3 Output of novel crawler

After the crawling is completed, the user can find an additional picture named "screenshot.png" (see Figure 4) and a "Peerless Magic Power.txt" file (see Figure 5) in the directory, which contains the text of the novel "Peerless Magic Power" (In chapter order) has been saved successfully.

■ Figure 4 Screenshot of Zhulang Novel Network

■ Figure 5 Part of the novel

The program successfully completed the task of downloading novels. The disadvantage is that it takes a long time and Chrome takes up a lot of hardware resources. For dynamic web pages, it is not necessary to use browser simulation to crawl. You can try to use browser developer tools to analyze the web page's request. After obtaining the interface, request the data by requesting the interface. Selenium is no longer needed as an "intermediary" ". In addition, for the screenshot obtained, the picture is a window screenshot, not a screenshot of the entire page (long picture). In order to obtain a screenshot of the entire page or a screenshot of some page elements, the user needs to use other methods, such as injecting JavaScript scripts, etc. , this article will not introduce it further.

 

Guess you like

Origin blog.csdn.net/qq_41640218/article/details/133177140