requests+bs4+xlwt module-simple crawler example-novel ranking list-Biquge articles
Crawl the link to the novel content article: Portal
Article Directory
section1: statement
1. The crawled content in this article is for users to view content for free.
2. This article is my own study notes and will not be used for commercial use.
3. If there is any infringement in this article, please contact me to delete the article! ! !
section2: content crawling analysis:
Since it is necessary to crawl the rankings and filter the data, it is certainly necessary to crawl multiple pages. Then let's first take a look at the changes in the URL of each page. Sometimes the rules can be followed.
By comparison, we can quickly find the pattern, and then use the release loop in this place.
for i in range(1, 5): # 根据想要爬的页数而改动
url = 'https://www.52bqg.net/top/allvisit/{}.html'.format(i+1)
Then it is to check and analyze a page.
The rule is also very obvious. A li node corresponds to the content of a novel. Using bs4, the content can be easily extracted.
After extracting the content, it is to save the content. Use the xlwt module to import the content into excel. Here I will put the code directly:
list_all = list()
path = 'D:/笔趣阁目录.xls'
workbook = xlwt.Workbook(encoding='utf-8', style_compression=0)
worksheet = workbook.add_sheet('小说目录', cell_overwrite_ok=True) # 可覆盖 # 设置工作表名
col = ('小说类型', '小说名', '最新章节', '作者', '最新更新时间')
for i in range(0, 5):
worksheet.write(0, i, col[i]) # 设置列名
for i in range(1, 5): # 根据想要爬的页数而改动
url = 'https://www.52bqg.net/top/allvisit/{}.html'.format(i)
data_list = get_content(url)
list_all.append([data_list])
for i in range(len(list_all)): # i=0~1
sleep(0.5) # 延迟0.5秒)
print('正在下载第{}页目录=====> 请稍后'.format(i+1))
data_s = list_all[i]
for j in range(len(data_s)): # j=0
data = data_s[j]
for k in range(len(data)): # k=0~49
data_simple = data[k]
for m in range(0, 5): # m=0~4
worksheet.write(1 + i * 50 + k, m, data_simple[m])
workbook.save(path)
(I am also because of this example, I just learned xlwt, I put the article I referenced in the back.)
section3: complete code
import requests
import bs4
import xlwt
from time import sleep
headers = {
'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}
def get_content(url):
res = requests.get(url=url, headers=headers)
html = res.text
soup = bs4.BeautifulSoup(html, 'html.parser')
soup = soup.select('.novelslistss li')
list_all = []
for novel in soup[0: 50]:
novel_type = novel.select('.s1')[0].string
novel_name = novel.select('.s2 a')[0].string
latest_chapters = novel.select('.s3 a')[0].string
author = novel.select('.s4')[0].string
update_time = novel.select('.s5')[0].string
list_all.append([novel_type, novel_name, latest_chapters, author, update_time])
return list_all
def main():
list_all = list()
path = 'D:/笔趣阁目录.xls'
workbook = xlwt.Workbook(encoding='utf-8', style_compression=0)
worksheet = workbook.add_sheet('小说目录', cell_overwrite_ok=True) # 可覆盖 # 设置工作表名
col = ('小说类型', '小说名', '最新章节', '作者', '最新更新时间')
for i in range(0, 5):
worksheet.write(0, i, col[i]) # 设置列名
for i in range(1, 5): # 根据想要爬的页数而改动
url = 'https://www.52bqg.net/top/allvisit/{}.html'.format(i)
data_list = get_content(url)
list_all.append([data_list])
for i in range(len(list_all)): # i=0~1
sleep(0.5) # 延迟0.5秒)
print('正在下载第{}页目录=====> 请稍后'.format(i+1))
data_s = list_all[i]
for j in range(len(data_s)): # j=0
data = data_s[j]
for k in range(len(data)): # k=0~49
data_simple = data[k]
for m in range(0, 5): # m=0~4
worksheet.write(1 + i * 50 + k, m, data_simple[m])
workbook.save(path)
print('所检索所有页面目录=======> 全部保存成功!'.format(i))
if __name__ == '__main__':
main()
section4: running results
You can use the filter function to filter your favorite novels with multiple conditions!
section5: Reference articles and learning links
1. The Python module xlwt writes to excel
Reference article: Click here to send
2. Crawling the ranking of the best universities in China (example)
Reference article: Click here to get