爬虫采集互联网数据的全过程;

热点时事新闻文章采集:

  1. 仅下载当天最新、热点的时事新闻;
  2. 不同网站的新闻保存在不同文件夹中,并记录每篇新闻的来源、标题、发布时间、下载时间、url地址等信息;
  3. 爬虫初始种子:新浪(news.sina.com.cn)、搜狐(news.sohu.com)、凤凰(news.ifeng.com)、网易(news.163.com)、百度(news.baidu.com)。

import requests
import bs4
import re
import datetime

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36"
}

url = 'http://news.sohu.com/'
response = requests.get(url, headers=headers)
print(response.status_code)
response.encoding = 'utf-8'
text = response.text
soup = bs4.BeautifulSoup(text, 'html.parser')
p1 = soup.findAll('div', { 'id': 'block4'})#找到新闻标题的所在标签名称
i = 0
k = 0
# print(p1)

for each in p1:
    href = each.select('a')#a标签下的href
    #detail_url = href.get('href')
print(href)
href = str(href)
pattern = re.compile(r'href="(.*?)" ')
l = pattern.findall(href)
prefix = 'http://news.sohu.com'
ls = [prefix + url for url in l]
print(ls)
title = [[] for _ in range(50)]
data = [[] for _ in range(50)]
source = [[] for _ in range(50)]
while i < ls.__len__():
    print(ls[i])
    response = requests.get(ls[i], headers=headers)
    response.encoding = 'utf-8'
    text = response.text
    soup = bs4.BeautifulSoup(text, 'html.parser')
    title[i] = soup.find('h1').text
    title[i]= ''.join(filter(lambda x: '\u4e00' <= x <= '\u9fa5', title[i].strip()))
    print(title[i])
    data[i] = soup.find('span', class_='time').text
    print(data)
    source[i] = soup.find('span', { 'data-role': 'original-link'}).text.strip()
    s1 = soup.findAll('article', { 'class': 'article'})
    for each in s1:
        hr = each.select('p')
    hr = str(hr)
    findjs = re.compile(r'<p.*?>(.*?)</.*?>')
    js = findjs.findall(hr)
    print(js)
    file3 = open(r'%s.txt'%title[i],

猜你喜欢

转载自blog.csdn.net/m0_72935705/article/details/135013197