热点时事新闻文章采集:
- 仅下载当天最新、热点的时事新闻;
- 不同网站的新闻保存在不同文件夹中,并记录每篇新闻的来源、标题、发布时间、下载时间、url地址等信息;
- 爬虫初始种子:新浪(news.sina.com.cn)、搜狐(news.sohu.com)、凤凰(news.ifeng.com)、网易(news.163.com)、百度(news.baidu.com)。
import requests
import bs4
import re
import datetime
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36"
}
url = 'http://news.sohu.com/'
response = requests.get(url, headers=headers)
print(response.status_code)
response.encoding = 'utf-8'
text = response.text
soup = bs4.BeautifulSoup(text, 'html.parser')
p1 = soup.findAll('div', {
'id': 'block4'})#找到新闻标题的所在标签名称
i = 0
k = 0
# print(p1)
for each in p1:
href = each.select('a')#即a标签下的href
#detail_url = href.get('href')
print(href)
href = str(href)
pattern = re.compile(r'href="(.*?)" ')
l = pattern.findall(href)
prefix = 'http://news.sohu.com'
ls = [prefix + url for url in l]
print(ls)
title = [[] for _ in range(50)]
data = [[] for _ in range(50)]
source = [[] for _ in range(50)]
while i < ls.__len__():
print(ls[i])
response = requests.get(ls[i], headers=headers)
response.encoding = 'utf-8'
text = response.text
soup = bs4.BeautifulSoup(text, 'html.parser')
title[i] = soup.find('h1').text
title[i]= ''.join(filter(lambda x: '\u4e00' <= x <= '\u9fa5', title[i].strip()))
print(title[i])
data[i] = soup.find('span', class_='time').text
print(data)
source[i] = soup.find('span', {
'data-role': 'original-link'}).text.strip()
s1 = soup.findAll('article', {
'class': 'article'})
for each in s1:
hr = each.select('p')
hr = str(hr)
findjs = re.compile(r'<p.*?>(.*?)</.*?>')
js = findjs.findall(hr)
print(js)
file3 = open(r'%s.txt'%title[i],