python爬虫搜狐新闻 - 代码天地

python爬虫搜狐新闻

其他 2018-09-22 14:19:41 阅读次数: 0

想做新闻的文本分类，先爬了一些搜狐网的新闻做数据集

#_*_coding:utf-8_*_
import requests
from bs4 import BeautifulSoup
import os

def geturl(URL):
    baseurl='http://news.sohu.com/guoneixinwen_%d.shtml'%URL
    request = requests.get(url=baseurl)
    respons = request.content      #得到页面源代码
    soup = BeautifulSoup(respons,'html.parser')   #解析源代码
    pagelist=soup.select('div.article')
    urllist=[]
    for page in pagelist:
        url=page.select('a')[1].attrs['href']
        urllist.append(url)
    return urllist

def get_xinwen(URL):
    urllist=geturl(URL)
    contentlist=[]
    for url in urllist:
        request = requests.get(url=url)
        respons = request.content      #得到页面源代码
        soup = BeautifulSoup(respons,'html.parser')   #解析源代码
        try:
            page=soup.select('div#contentText')[0]
            # page=str(page)
            # pattern =re.compile(u'[\u4e00-\u9fa5]+')
            # content=pattern.search(page)
            content=page.text
            contentlist.append(content)
        except IndexError:
            pass
    return contentlist

if __name__=="__main__":
    for num in range(12081,12181):
        a=get_xinwen(num)
        lenth=len(a)
        os.mkdir('./%d'%num)
        for i in range(lenth):
            f=open('./%d/%d.txt'%(num,i),'w',encoding='utf-8')
            f.write(a[i])     #写入txt

猜你喜欢

转载自blog.csdn.net/heavenmark/article/details/77280576

python爬虫搜狐新闻

Python与爬虫入门实践——简易搜狐新闻爬虫02

Python与爬虫入门实践——简易搜狐新闻爬虫01

python爬取搜狐网的新闻

搜狐新闻下载|搜狐新闻app下载

处理搜狐新闻语料

python爬虫高校新闻

python爬虫实践（腾讯新闻）

python实现新浪新闻爬虫

Python爬虫新闻实例代码

Python BeautifulSoup 爬虫入门笔记 --- 新闻爬虫

利用朴素贝叶斯分类算法对搜狐新闻进行分类（python）

基于TfidfVectorizer的搜狐新闻文本分类

Jsoup+HttpUnit爬取搜狐新闻

Python爬虫汽车之家新闻消息

python爬虫——爬取汽车之家新闻

简单python爬虫爬取新浪新闻

Python爬虫爬取新浪新闻内容

python爬虫之爬取腾讯新闻

python爬虫【二】爬取新闻

Python 新浪实时新闻爬虫

Python小白的爬虫代码——澎湃新闻列表

python爬虫实战——爬取腾讯新闻！

python学习之新闻爬虫（五）

爬虫 python 爬取澎湃新闻

python爬虫：爬取新浪新闻数据

Python爬虫 | 爬取环境新闻实战

利用搜狐新闻语料库训练100维的word2vec——使用python中的gensim模块

Python爬虫实战项目：简单的百度新闻爬虫

Python爬虫实战(5)_面向新闻网站的爬虫

今日推荐

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

周排行

Java基础复习_day13_Collection集合

2018.11.16 c语言学习经验

且看Java内置四大核心函数式接口

小程序云开发中数据库的数据分段和显示图片

python的函数

Web-JS进阶

【干货】C++常用代码积累笔记大全

Spring的ioc操作与 IOC底层原理

构建之法20191121-11 Scrum立会报告+燃尽图 07

Spring boot之Hello World访问404

每日归档

更多

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)