python3 爬虫爬取blog内容 - 代码天地

python3 爬虫爬取blog内容

其他 2018-07-03 10:43:51 阅读次数: 0

#!/usr/bin/python3
# _*_ coding:UTF-8 _*_

import requests
from bs4 import BeautifulSoup

class Downloader(object):
def __init__(self):
self.server = 'https://blog.csdn.net/zhangyun75'
self.urls = []

def get_download_url(self):
req = requests.get(url = self.server)
html = req.text
div_bf = BeautifulSoup(html, "lxml")
div = div_bf.find_all('div', class_ = 'article-list')
a_bf = BeautifulSoup(str(div[0]), "lxml")
a = a_bf.find_all('a')
for each in a:
link = each.get('href')
if self.urls.count(link) == 0:
self.urls.append(link)

def get_contents(self, target):
req = requests.get(url = target)
html = req.text
bf = BeautifulSoup(html, "lxml")
title = bf.find_all('h1', class_ = 'title-article')
title = title[0].text
texts = bf.find_all('div', class_ = 'htmledit_views')
texts = texts[0].text.replace('\n', '')
# print(target, texts)
return title, texts

dl = Downloader()
dl.get_download_url()
print('start downloading:')
#print(dl.urls)
i = 0
for url in dl.urls:
title, text = dl.get_contents(url)
# print(title, text, url)
with open("./downfile/file"+str(i)+".txt", 'w', encoding='utf-8') as f:
f.write(text+'\n')
i = i + 1
print("已下载:%.3f%%" % (100 * float(i/len(dl.urls))))
print('finish downloading:')

猜你喜欢

转载自blog.csdn.net/zhangyun75/article/details/80818911

python3 爬虫爬取blog内容

[Python3爬虫]爬取新浪微博用户信息及微博内容

Python3网络爬虫：requests爬取动态网页内容

Python爬虫爬取新浪新闻内容

python3爬取页面内容并筛选

python3定向爬取网页内容

【Python3 爬虫】17_爬取天气信息

python3 --小爬虫（爬取美剧字幕）

python3爬虫爬取网页图片简单示例

Python3 爬虫实战（并发爬取）

python3爬虫之二：爬取网页图片

python3爬虫爬取煎蛋网妹纸图片

python3 爬虫学习之爬取猫眼电影

Python3爬虫爬取VIP视频

python3爬虫 —— 爬取豆瓣电影信息

python3爬虫-使用requests爬取起点小说

python3爬虫爬取猫眼电影TOP100（含详细爬取思路）

python3 爬虫

python3爬虫

python3爬取网页图片

Python3——爬取淘宝评论

python3爬取图片

python3爬取租房的信息

python3 爬取影像数据

使用python3爬取小说

python3 爬取API数据

使用Python3爬取美女

Python3爬取音乐

python3 爬取天气网页

python爬虫-爬取壁纸酷主页内容

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)