Python crawling novel website actual combat

2017.12.9

Ideas were generated at noon and large-scale crawling of novels on novel websites. So, there are the following questions

  1. The download link may be invalid. The download link of most websites is essentially detected by the browser’s built-in detection program and the download box pops up. Solution: temporarily confirm that the link is available in the test in advance, but even so, the link will still fail The situation
  2. The problem of hiding the webpage. Solution: Through observation and comparison, it is found that the links of the second-level webpage are regular, so write the link generation function and save it with list
def seturl(ur,i):
    y=0                                                                             
    urllist=[x for x in range(2, i+ 1)]
    for x in range(2, i+ 1):
        url=ur+str(x)+'.html'
        urllist[y]=url
        y=y+1
    return urllist
  1. The problem that the program stops automatically due to a link failure Solution: throw an exception 
    try:
        print('正在下载: 小说  '+name+'...')
        urllib.request.urlretrieve(urla,filepath)
    except:
        print(name+'  '+'下载失败')
    else:
        print(name+'  '+'下载成功') 
                                                                                                                                             
  2. According to the original requirements, the program should start downloading the program from the home page and continuously get the classification links downwards, but after testing, there are a large number of classification links unavailable. Solution: cancel the original plan to analyze from the home page, and temporarily set the classification page Layer) as the home page

The above are the main problems encountered in programming

The idea of ​​the whole program, that is, constantly processing web links

 

 

Where the program needs to be optimized

  1. The link failure problem mentioned above
  2. Download data tracking and statistics issues, including file download size, progress, time-consuming, etc.
  3. Automatic folder generation problem when downloading categories
  4. Download speed problem (I don't know if it is a link problem of the website itself or the program can be further optimized)
  5. Try multithreading?
  6. Can an exe file be generated for execution, a graphical interface?

The reference link is as follows

python notes

Regular expression

BeautifulSoup parsing html method (crawler)

http://blog.csdn.net/u013372487/article/details/51734047

Regular expression online test

https://c.runoob.com/front-end/854

python regular re part

http://www.cnblogs.com/tina-python/p/5508402.html

Exception handling

http://www.runoob.com/python/python-exceptions.html

http://www.cnblogs.com/zhangyingai/p/7097920.html

python download file

http://www.360doc.com/content/13/0929/20/11729272_318036381.shtml

Python calculates running time

http://www.cnblogs.com/rookie-c/p/5827694.html

The program source code is as follows to configure the environment

python3

//用到的包
import urllib

import requests

from urllib import request

from bs4 import BeautifulSoup

import re

import time

seturl.py

import urllib
# http://www.txt53.com/html/qita/list_45_3.html
def seturl(ur,i):
    y=0
    urllist=[x for x in range(2, i+ 1)]
    for x in range(2, i+ 1):
        url=ur+str(x)+'.html'
        urllist[y]=url
        y=y+1
    return urllist

end.py

import urllib
from urllib import request
import download
from bs4 import BeautifulSoup
import requests
import re
def end(url):
    # url='http://www.txt53.com/down/42598.html'
    header={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
    html=requests.get(url,headers=header)
    soup=BeautifulSoup(html.text)
    getContent=soup.find('div',attrs={"class":"xiazai"})
    pattern=re.compile('http://www.txt53.com/home/down/txt/id/[0-9]*')
    link=pattern.findall(str(getContent))
    getContent=soup.find('a',attrs={"class":"shang"})
    res = r'<a .*?>(.*?)</a>'
    name =re.findall(res,str(getContent), re.S|re.M)
    pattern=re.compile( r'<a>(.*?)</a>')
    # name=pattern.findall(str(getContent))
    # print(name)
    strurl=link[0]
    download.download(strurl,name)

download.py

import requests
import urllib
from urllib import request
def download(urla,na):
    name=str(na[0])
    filepath='D:/小说下载/'+name+'.txt'
    try:
        print('正在下载: 小说  '+name+'...')
        urllib.request.urlretrieve(urla,filepath)
    except:
        print(name+'  '+'下载失败')
    else:
        print(name+'  '+'下载成功')

mian.py

import urllib
import requests
from urllib import request
from bs4 import BeautifulSoup
import re
import end
import seturl
# import timelist
import time
import download
# 1.多层嵌套,在主网页中打开获取链接,层层分析取得目标链接
# timelist.viewBar(200)
startime=time.clock()
ur='http://www.txt53.com/html/qita/list_45_'
i=91
numa=0
urllist=seturl.seturl(ur,i)
for url in urllist[10:]:
print('正在自动打开新的页面 '+url)
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
# get html内容
html = requests.get(url, headers=header)
soup = BeautifulSoup(html.text)
getContent = soup.findAll('li', attrs={"class": "qq_g"})
pattern = re.compile(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')")
link = pattern.findall(str(getContent))
# 拿到排行榜上的链接
for list in link:
# 进入单页的详细介绍界面
urlb = list
html = requests.get(urlb, headers=header)
soup = BeautifulSoup(html.text)
getContent = soup.findAll('div', attrs={"class": "downbox"})
pattern = re.compile(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')")
link = pattern.findall(str(getContent))
print('正在打开链接:' + link[0])
end.end(link[0])
numa=numa+1
print('共下载小说'+str(numa)+'部,下载完成!')
endtime=time.clock()
usetime=endtime-startime
print('本次爬取程序共耗时%f秒'%usetime)

 

After the long-term operation of the program, the remote host detects frequent requests and disconnects

The processing method is as follows

[Python crawler error] ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

http://blog.csdn.net/illegalname/article/details/77164521

6 common methods for Python crawlers to break through the ban

http://blog.csdn.net/offbye/article/details/52235139

Guess you like

Origin blog.csdn.net/qq_36766417/article/details/106060770