Use PYTHON to crawl novels (txt format)

I forgot where I copied the source code, but it is really easy to use. If there is a string or something when you climb down, modify the re match by yourself.

This little crawler uses three modules, requests, parsel, and re module**

By convention, import the module first:
import requests
import parsel
import re

Then it is to disguise the browser logo, URL and so on. I am lazy here, and directly locate the URL of this novel on the website.
url = "http://www.xpaoshuba.com/Partlist/61563/"
I personally prefer to read it For science fiction novels, just take a science fiction novel as an example

After defining the URL, the next step is to start analyzing the URL:
response = requests.get(url)
responses = response.text
#print(responses) The selector
used to analyze the content of the web page when writing
= parsel.Selector(responses)
novel_name = selector.css('#info h1::text').get() #Novel name
print(novel_name)
This is the label for analyzing the web page. The name of the novel is in the h1 tag, so define the label directly and get the name of the novel. I have a bad habit, so I added a novel name in the running box so that I can see the progress;
insert image description here
the next step is to find the chapter name of the novel and the label of the single chapter link:
href = selector.css('#list dd a: :attr(href)').getall() #The chapters of the novel
can be seen here, the chapter names and links can be found directly with the dd tag, I will be lazy and only grab the links, because the chapter names are in the list behind There are some in Zhang Novels; insert image description here
the captured links, I like to store them in a circular way, and then a cycle is completed and a book is also captured:
for link in href:
the chapter link is captured above , but the link needs to be spliced ​​before it can be used. Let’s save a little trouble here and splice it manually:
link_url = 'http://www.xpaoshuba.com' + link The
link here is spliced, and the next step is still the same, request the link, Then analyze what is returned:
response_1 = requests.get(link_url)
responses_1 = response_1.text
selecter_1 = parsel.Selector(responses_1)
After analyzing the content, here are mainly two things, one is the content of the leaflet, and the other is the title:
title_name = selecter_1.css ('.zhangjieming h1::text').get() #Novel chapter
content_list = selecter_1.css('#content p::text').getall() #Novel content
variable name, how to type it yourself , but pay attention here, the content of the novel needs to be processed by str, otherwise it cannot be written into the text file, that is, your txt cannot be saved:
ck=str(content_list)
bk=ck.replace("', '",' \n')
go to the meaningless characters contained in it;

Next is the final link, write the crawled content into a text file in the order of the chapters of the novel:
try:
with open('12.txt', 'a') as f:
f.write(title_name)
f.write ('\n')
print(title_name)
f.write(bk)
except:
print(link_url + "Something went wrong!")
pass

The try...exceptet is used here to prevent errors from being reported when writing, and then the crawler collapses, and an error is reported when writing a single page, so go to the pirated website to read a chapter on your mobile phone, hahahahahaha. Then, after the chapter name of each chapter, a newline character must be added, otherwise it cannot be read, so the function of this '/n' is a newline, and nothing else.

The complete code is attached below:

import requests
import parsel
import re
url = "http://www.xpaoshuba.com/Partlist/61563/"
response = requests.get(url)
responses = response.text
#print(responses)
selector = parsel.Selector(responses)
novel_name = selector.css('#info h1::text').get() #小说名
print(novel_name)
href = selector.css('#list dd a::attr(href)').getall() #小说章节
#print(href)
for link in href:
    link_url = 'http://www.xpaoshuba.com' + link
    response_1 = requests.get(link_url)
    responses_1 = response_1.text
    selecter_1 = parsel.Selector(responses_1)
    title_name = selecter_1.css('.zhangjieming h1::text').get() #小说章节
    content_list = selecter_1.css('#content p::text').getall() #小说内容
    ck=str(content_list)
    bk=ck.replace("', '",'\n')
    #print(bk)
    try:
        with open('12.txt', 'a') as f:
            f.write(title_name)
            f.write('\n')
            print(title_name)
            f.write(bk)
            #print(ck)
    except:
        print(link_url + "出错了!")
    pass

Crawling novels is still simple, and you can even rent a server and set up a pirated website, but it feels meaningless and prone to accidents, it is only for entertainment, please do not use it for profit! ! !

Guess you like

Origin blog.csdn.net/weixin_44853413/article/details/131698470