Preface:
1. First grab the url of the tested website.
2. Then grab the content of the news.
3. Get the field to be grabbed and print it out.
4. Because there are multiple urls, a for loop is used.
5. 、Because some news can be retrieved, some news can’t be retrieved, so add if judgment
6. Get the data to be captured, and then print it out
7. You can put the printed data with xls, but I I really can't figure it out. So my stupid way is to print line by line, manually paste it to excl, and then just use it.
1. Find the tested website
Measured object: http://www.ga.dl.gov.cn/index.php?app=jwzx&act=newslist&id=6&page=1
2. Find the URL address
Code: The
following code finds the URL addresses of all articles
import requests
from lxml import etree
#把代码存入本地
data = open(r'D://hello3.xls', 'w')
for x in range(1,5):#循环1,2,3,4页
s = str(x)
url1 = 'http://www.ga.dl.gov.cn/index.php?app=jwzx&act=newslist&id=6&page=' + s
r = requests.get(url1).content
topic=etree.HTML(r)
url=topic.xpath('//div[@class="jwzxListLeft fl"]//li/a/@href')#xpath 查url
for i in url:
# print(i)
#把地址存入本地
data.write(i+'\n')
data.close()
Idea: Get all the urls of the news, and then
grab the news of each url 3. Grab the source of each website, news content, news headlines
import requests
from lxml import etree
f = open('D://111111111111111111111111111.txt', 'rb') # 以只读方式打开一个文件,获取文件句柄,如果是读的话,r可以不写,默认就是只读,
line = f.readlines()
for i in line:
data = i.decode()
messae = data.strip()
a = 'http://www.ga.dl.gov.cn/' + messae#利用for循环,取多个url
r=requests.get(a).content#解析url
topic=etree.HTML(r)
texts =topic.xpath('//*[@class="artCon"]/p/span/text()')#xpath取新闻内容
title = topic.xpath('//*[@class="artTit"]/h1/text()')#xpath取新闻标题
title1 = ''.join(title)#把列表转为str类型
source =topic.xpath('//*[@class="abs"]/span[1]/text()')
source1 = ''.join(source)[3:]#把列表类型转为str 新闻来源只要后面来源: 之后 表示[3:] 如果是3之前 [3:]
text = [x for x in texts]#链接地址 也就是html的地址 (列表推导式,把新闻内容写入列表)
if text != []:
print(a,title1,source1,text)# 因为抓取的数据,有的有,有的没有,所以把有数据的打印出来
4. Find the content of the article
The following code can refer to
Article view and download: https://blog.csdn.net/weixin_41665637/article/details/103051444
python read file and save file template: https://blog.csdn.net/weixin_41665637/article/details/103030166
crawler xpath: https://blog.csdn.net/weixin_41665637/article/details/90637175
Note: If you have any questions, please leave a message.
Who can help solve the file download, please leave a message and tell me how to do it. Thank you~~~