Newspaper库,一个新手也能快速上手的爬虫库

目录

Newspaper

安装

实战

1. 抓取CSDN上的文章

2. 查阅网易新闻的内容

总结


Newspaper

是一个强大的Python库,专门用于从新闻网站和文章中提取信息。它提供了一种简单而高效的方式来抓取新闻网页,解析内容,并提取出有用的信息,如文章标题、正文、作者、发布日期等。

首先,Newspaper框架在GitHub上获得了众多开发者的认可,其点赞排名也相当靠前,显示出其在Python爬虫领域的受欢迎程度。它特别适用于抓取新闻类网页,因为新闻网站通常具有较为规范的HTML结构和内容格式。

使用Newspaper库非常简单,即使是完全没有爬虫经验的初学者也能快速上手。用户只需要提供目标网页的URL,Newspaper就能自动下载网页内容并进行解析。而且,它不需要用户考虑header、IP代理、网页解析和网页源代码架构等复杂问题,大大简化了爬虫的开发过程。

在提取信息方面,Newspaper提供了丰富的功能。除了基本的文章文本提取外,它还能自动识别并提取出文章的作者、发布日期等关键信息。Newspaper还支持多种语言,包括英语、中文、德语、阿拉伯语等,使得它可以适应不同国家和地区的新闻网站。Newspaper的另一个亮点是它支持多进程文章下载。这意味着它可以同时处理多个网页,大大提高了数据抓取的效率。此外,它还能识别新闻链接,并从HTML文件中提取文本、图片等多媒体内容,为用户提供更全面的新闻信息。

安装

pip install newspaper3k

注意是 newspaper3k,导入时用 import newspaper

安装过程:

C:\Users>pip install newspaper3k
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting newspaper3k
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.1/211.1 kB 1.8 MB/s eta 0:00:00
Collecting beautifulsoup4>=4.4.1 (from newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b1/fe/e8c672695b37eecc5cbf43e1d0638d88d66ba3a44c4d321c796f4e59167f/beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147.9/147.9 kB 9.2 MB/s eta 0:00:00
Requirement already satisfied: Pillow>=3.3.0 in d:\program files\python\lib\site-packages (from newspaper3k) (10.2.0)
Requirement already satisfied: PyYAML>=3.11 in d:\program files\python\lib\site-packages (from newspaper3k) (6.0.1)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/06/a9/2da08717a6862c48f1d61ef957a7bba171e7eefa6c0aa0ceb96a140c2a6b/cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting lxml>=3.6.0 (from newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/02/59/e1fbe2514d8ab39977b72e77f98d0fa49772f61e938049baf151b307a4f0/lxml-5.1.0-cp312-cp312-win_amd64.whl (3.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 11.3 MB/s eta 0:00:00
Collecting nltk>=3.2.1 (from newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a6/0a/0d20d2c0f16be91b9fa32a77b76c60f9baf6eba419e5ef5deca17af9c582/nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 19.2 MB/s eta 0:00:00
Requirement already satisfied: requests>=2.10.0 in d:\program files\python\lib\site-packages (from newspaper3k) (2.31.0)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7c/d4/8c31aad9cc18f451c49f7f9cfb5799dadffc88177f7917bc90a66459b1d7/feedparser-6.0.11-py3-none-any.whl (81 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.3/81.3 kB 4.4 MB/s eta 0:00:00
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/fc/6d/8eaafb735b39c4ab3bb8fe4324ef8f0f0af27a7df9bb4cd503927bd5475d/tldextract-5.1.2-py3-none-any.whl (97 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.6/97.6 kB 5.8 MB/s eta 0:00:00
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/35/82/1251fefec3bb4b03fd966c7e7f7a41c9fc2bb00d823a34c13f847fd61406/feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... done
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a9/cb/2c8332bcdc14d33b0bedd18ae0a4981a069c3513e445120da3c3f23a8aaa/jieba3k-0.35.1.zip (7.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/7.4 MB 15.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: python-dateutil>=2.5.3 in d:\program files\python\lib\site-packages (from newspaper3k) (2.8.2)
Collecting tinysegmenter==0.3 (from newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/17/82/86982e4b6d16e4febc79c2a1d68ee3b707e8a020c5d2bc4af8052d0f136a/tinysegmenter-0.3.tar.gz (16 kB)
  Preparing metadata (setup.py) ... done
Collecting soupsieve>1.2 (from beautifulsoup4>=4.4.1->newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4c/f3/038b302fdfbe3be7da016777069f26ceefe11a681055ea1f7817546508e3/soupsieve-2.5-py3-none-any.whl (36 kB)
Requirement already satisfied: six in d:\program files\python\lib\site-packages (from feedfinder2>=0.0.4->newspaper3k) (1.16.0)
Collecting sgmllib3k (from feedparser>=5.2.1->newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9e/bd/3704a8c3e0942d711c1299ebf7b9091930adae6675d7c8f476a7ce48653c/sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... done
Collecting click (from nltk>=3.2.1->newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/00/2e/d53fa4befbf2cfa713304affc7ca780ce4fc1fd8710527771b58311a3229/click-8.1.7-py3-none-any.whl (97 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.9/97.9 kB 5.5 MB/s eta 0:00:00
Collecting joblib (from nltk>=3.2.1->newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/10/40/d551139c85db202f1f384ba8bcf96aca2f329440a844f924c8a0040b6d02/joblib-1.3.2-py3-none-any.whl (302 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.2/302.2 kB 19.5 MB/s eta 0:00:00
Collecting regex>=2021.8.3 (from nltk>=3.2.1->newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1d/af/4bd17254cdda1d8092460ee5561f013c4ca9c33ecf1aab81b44280327cab/regex-2023.12.25-cp312-cp312-win_amd64.whl (268 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.9/268.9 kB 17.2 MB/s eta 0:00:00
Collecting tqdm (from nltk>=3.2.1->newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2a/14/e75e52d521442e2fcc9f1df3c5e456aead034203d4797867980de558ab34/tqdm-4.66.2-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 4.3 MB/s eta 0:00:00
Requirement already satisfied: charset-normalizer<4,>=2 in d:\program files\python\lib\site-packages (from requests>=2.10.0->newspaper3k) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in d:\program files\python\lib\site-packages (from requests>=2.10.0->newspaper3k) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in d:\program files\python\lib\site-packages (from requests>=2.10.0->newspaper3k) (2.1.0)
Requirement already satisfied: certifi>=2017.4.17 in d:\program files\python\lib\site-packages (from requests>=2.10.0->newspaper3k) (2023.11.17)
Collecting requests-file>=1.4 (from tldextract>=2.0.1->newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/10/fd/321e33597e09cb4368d361b0b6c6573ef45d5f693acef41ba33673a55b7c/requests_file-2.0.0-py2.py3-none-any.whl (4.2 kB)
Collecting filelock>=3.0.8 (from tldextract>=2.0.1->newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/81/54/84d42a0bee35edba99dee7b59a8d4970eccdd44b99fe728ed912106fc781/filelock-3.13.1-py3-none-any.whl (11 kB)
Collecting colorama (from click->nltk>=3.2.1->newspaper3k)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d1/d6/3965ed04c63042e047cb6a3e6ed1a63a35087b6a609aa3a15ed8ac56c221/colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Building wheels for collected packages: tinysegmenter, feedfinder2, jieba3k, sgmllib3k
  Building wheel for tinysegmenter (setup.py) ... done
  Created wheel for tinysegmenter: filename=tinysegmenter-0.3-py3-none-any.whl size=13568 sha256=d393a7188655925876d6346456dfefb3c3c78d46928be57e0c464f6dbd4a02a2
  Stored in directory: c:\users\boyso\appdata\local\pip\cache\wheels\b2\9d\99\03ac91b1b064af680304b0051f838ec5b8b6f1507e5f3dd39e
  Building wheel for feedfinder2 (setup.py) ... done
  Created wheel for feedfinder2: filename=feedfinder2-0.0.4-py3-none-any.whl size=3359 sha256=da89dae9448e7f383243238a35d45989a93841af7e7d0b3da14a1c4c403f8c2a
  Stored in directory: c:\users\boyso\appdata\local\pip\cache\wheels\97\c2\39\7f52b924caacbee2039a186a0ac787062f9eaeb165d73e4e94
  Building wheel for jieba3k (setup.py) ... done
  Created wheel for jieba3k: filename=jieba3k-0.35.1-py3-none-any.whl size=7398388 sha256=cb8053d445dc4c9cb499865e105bba8a69f65022ce15b205da8537ce382779a2
  Stored in directory: c:\users\boyso\appdata\local\pip\cache\wheels\4a\90\9d\d6bbab88e3ba8442ab9ff803197693859ef98b623f0123c68f
  Building wheel for sgmllib3k (setup.py) ... done
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6060 sha256=6b82531de468b4b36ab4bc73ca2bf3532f0e75233140d1ed66f5ec5862ed2c8b
  Stored in directory: c:\users\boyso\appdata\local\pip\cache\wheels\4f\48\8e\790492080c5e85ff08704e66c6e1cc9bd96ac391bb890426bf
Successfully built tinysegmenter feedfinder2 jieba3k sgmllib3k
Installing collected packages: tinysegmenter, sgmllib3k, jieba3k, soupsieve, regex, lxml, joblib, filelock, feedparser, cssselect, colorama, tqdm, requests-file, click, beautifulsoup4, tldextract, nltk, feedfinder2, newspaper3k
Successfully installed beautifulsoup4-4.12.3 click-8.1.7 colorama-0.4.6 cssselect-1.2.0 feedfinder2-0.0.4 feedparser-6.0.11 filelock-3.13.1 jieba3k-0.35.1 joblib-1.3.2 lxml-5.1.0 newspaper3k-0.2.8 nltk-3.8.1 regex-2023.12.25 requests-file-2.0.0 sgmllib3k-1.0.0 soupsieve-2.5 tinysegmenter-0.3 tldextract-5.1.2 tqdm-4.66.2

实战

1. 抓取CSDN上的文章

from bs4 import BeautifulSoup
from newspaper import Article

usr = 'boysoft2002' # 可以换其他用户ID,注意不是昵称是ID

url = 'https://blog.csdn.net/'+usr
article = Article(url)
article.download()
soup = BeautifulSoup(article.html, 'html.parser')
article_links = soup.find_all('a')

for link in article_links:
    link = link.get('href')
    if link and link.startswith(url+'/article/details'):
        article = Article(link)
        article.download()
        article.parse()
        print(f"Title:{article.title[:-7]}")
        print(f"Link:{article.url}")
        print(f"Text:{article.text}\n")

print('Completed!')

运行结果:

Title: Git 分布式版本控制系统基本概念和操作命令
Link: https://blog.csdn.net/boysoft2002/article/details/136970364
Text:  Squeezed text(158 lines). 
Title: python共有26个内置类,你知道几个?
Link: https://blog.csdn.net/boysoft2002/article/details/136953633
Text: 本文要介绍的 Python 内置函数属于核心语言的一部分。我们在介绍数据类型、控制结构的文章中已经逐渐学习了 Python 的基本语法,不过还是需要一个时机去学习 Python 中零落无序的内置函数,这就是本期文章的目的。Python 官方提供了 68 个内置函数,这些内置函数主要提供简单且基础的功能,实用性高。需要提醒的是,我们在使用自定义函数时,应该尽量避免函数名与内置函数的名称一样,否则有可能导致程序异常。下面我们以处理数据为目的,分类别向大家介绍 Python 中常用的内置函数。
Title: Help on built-in functions in module builtins (74)
Link: https://blog.csdn.net/boysoft2002/article/details/136951153
Text: 这是Python解释器给出的print()函数的帮助信息。 print()函数是Python内置函数之一,它可以将指定的参数打印输出到标准输出流(默认为sys.stdout)或指定的文件流中。print()函数的参数可以是任意类型的对象,包括字符串、数字、列表、元组、字典等等。 print()函数的参数包括: - *args:表示可变参数,可以传入任意个参数,多个参数之间用逗号隔开。 - sep:表示输出多个参数时,参数之间的分隔符,默认值是一个空格。 - end:表示输出结束时的字符,默认值是一个换行符。 - file:表示输出的目标文件流,默认值是sys.stdout,即标准输出流。 - flush:表示是否立即刷新输出流,默认值是False,即不立即刷新。 例如,下面的语句将会输出三个字符串,并且每个字符串之间用逗号隔开,最后不换行: ``` print("hello", "world", "Python", sep=", ", end="") ```
Title: python 教你如何创建一个自定义库 colorlib.py
Link: https://blog.csdn.net/boysoft2002/article/details/136861675
Text:  Squeezed text(1311 lines). 
Title: Http 超文本传输协议基本概念学习摘录
Link: https://blog.csdn.net/boysoft2002/article/details/136851562
Text:  Squeezed text(238 lines). 
Title: python自定义日历库,与对应calendar库函数功能基本一致
Link: https://blog.csdn.net/boysoft2002/article/details/136823417
Text:  Squeezed text(497 lines). 
Title: python calendar内置日历库函数方法
Link: https://blog.csdn.net/boysoft2002/article/details/136770420
Text:  Squeezed text(409 lines). 
Title: Python 一步一步教你用pyglet制作汉诺塔游戏(终篇)
Link: https://blog.csdn.net/boysoft2002/article/details/136008639
Text:  Squeezed text(322 lines). 
Title: Python 一步一步教你用pyglet制作汉诺塔游戏(续)
Link: https://blog.csdn.net/boysoft2002/article/details/136634444
Text:  Squeezed text(368 lines). 
Title: Python 一步一步教你用pyglet制作汉诺塔游戏
Link: https://blog.csdn.net/boysoft2002/article/details/136598320
Text:  Squeezed text(191 lines). 
Title: Python 初步了解urllib库:网络请求的利器
Link: https://blog.csdn.net/boysoft2002/article/details/136589553
Text:  Squeezed text(208 lines). 
Title: Python 一步一步教你用pyglet仿制鸿蒙系统里的时钟
Link: https://blog.csdn.net/boysoft2002/article/details/136578359
Text:  ...略...
Title: Python 一步一步教你用pyglet制作可播放音乐的扬声器类
Link: https://blog.csdn.net/boysoft2002/article/details/136522563
Text:  ...略...
Title: python INI文件操作与configparser内置库
Link: https://blog.csdn.net/boysoft2002/article/details/136546933
Text:  Squeezed text(570 lines). 
Title: Pandas DataFrame 基本操作实例100个
Link: https://blog.csdn.net/boysoft2002/article/details/136437876
Text:  Squeezed text(498 lines). 
Title: Pyglet图形界面版2048游戏——详尽实现教程(上)
Link: https://blog.csdn.net/boysoft2002/article/details/136404961
Text:  Squeezed text(308 lines). 
Title: 常用SQL查询方法与实例
Link: https://blog.csdn.net/boysoft2002/article/details/136380309
Text:  Squeezed text(277 lines). 
Title: python 小游戏《2048》字符版非图形界面
Link: https://blog.csdn.net/boysoft2002/article/details/136329625
Text:  Squeezed text(511 lines). 
Title: python|闲谈2048小游戏和数组的旋转及翻转和转置
Link: https://blog.csdn.net/boysoft2002/article/details/136329641
Text:  Squeezed text(919 lines). 
Completed!

注:Squeezed text(xxx lines). 这个是因为内容太多,IDLE缩略了点击黄色背景文字就会展开。但是都点开太耗内存,修改一下代码把文字输出到文本文件,这样再打开文件慢慢阅读好了。

from bs4 import BeautifulSoup
from newspaper import Article

usr = 'boysoft2002'

url = 'https://blog.csdn.net/'+usr
article = Article(url)
article.download()
soup = BeautifulSoup(article.html, 'html.parser')
article_links = soup.find_all('a')

file = open('csdnDoc.txt','w',encoding='utf-8')
index = 0
for link in article_links:
    link = link.get('href')
    if link and link.startswith(url+'/article/details'):
        article = Article(link)
        article.download()
        article.parse()
        index += 1
        print(f"Title-{index:02}:{article.title[:-7]}", file=file)
        print(f"Link:{article.url}", file=file)
        print(f"Text:{article.text}\n", file=file)
file.close()
print('Completed!')

从代码看,用newspaper库读取网页内容还是很方便的,不需要使用者考虑header、IP代理等问题,比较一下requests获取链接的代码:

import requests  
from bs4 import BeautifulSoup
  
url = 'https://blog.csdn.net/boysoft2002'  
headers = {  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
article_links = soup.find_all('a')

#......

2. 查阅网易新闻的内容

from bs4 import BeautifulSoup
from newspaper import Article

url = 'https://news.163.com/world/'
article = Article(url)
article.download()
soup = BeautifulSoup(article.html, 'html.parser')
article_links = soup.find_all('a')

for link in article_links:
    link = link.get('href')
    if link and '/article' in link:
        article = Article(link)
        article.download()
        article.parse()
        print(f"Title:{article.title}")
        print(f"Link:{article.url}")
        print(f"Text:{article.text}\n")

print('Completed!')

 新闻标题和链接都能爬出来,内容能抓取到的较少,读者可再试试其它新闻网站的爬取效果。


总结

总的来说,Newspaper是一个非常适合初学者和新闻类爬虫需求的Python库。它简单易用,功能丰富,能够帮助用户快速地从新闻网站中提取所需信息。然而,对于更复杂的项目或者网站有强劲的反爬虫功能的话,可能会在处理过程中出现各种bug或者被目标网站直接拒绝访问或者的情况,此时就需要结合其他工具或框架来实现更稳定、更高效的爬取任务。

猜你喜欢

转载自blog.csdn.net/boysoft2002/article/details/136975040