Crawler gets news content [5]

Preface:
1. First grab the url of the tested website.
2. Then grab the content of the news.
3. Get the field to be grabbed and print it out.
4. Because there are multiple urls, a for loop is used.
5. 、Because some news can be retrieved, some news can’t be retrieved, so add if judgment
6. Get the data to be captured, and then print it out
7. You can put the printed data with xls, but I I really can't figure it out. So my stupid way is to print line by line, manually paste it to excl, and then just use it.

1. Find the tested website

Measured object: http://www.ga.dl.gov.cn/index.php?app=jwzx&act=newslist&id=6&page=1
Insert picture description here

2. Find the URL address

Insert picture description here
Code: The
following code finds the URL addresses of all articles

import requests
from lxml import etree
#把代码存入本地
data = open(r'D://hello3.xls', 'w')
for x in range(1,5):#循环1,2,3,4页
    s = str(x)
    url1 = 'http://www.ga.dl.gov.cn/index.php?app=jwzx&act=newslist&id=6&page=' + s
    r = requests.get(url1).content
    topic=etree.HTML(r)
    url=topic.xpath('//div[@class="jwzxListLeft fl"]//li/a/@href')#xpath 查url
    for i in url:
        # print(i)
        #把地址存入本地
        data.write(i+'\n')
data.close()

Idea: Get all the urls of the news, and then
Insert picture description here
grab the news of each url 3. Grab the source of each website, news content, news headlines

import requests
from lxml import etree
f = open('D://111111111111111111111111111.txt', 'rb')  # 以只读方式打开一个文件,获取文件句柄,如果是读的话,r可以不写,默认就是只读,
line = f.readlines()
for i in line:
    data = i.decode()
    messae = data.strip()
    a = 'http://www.ga.dl.gov.cn/' + messae#利用for循环,取多个url
    r=requests.get(a).content#解析url
    topic=etree.HTML(r)
    texts =topic.xpath('//*[@class="artCon"]/p/span/text()')#xpath取新闻内容
    title = topic.xpath('//*[@class="artTit"]/h1/text()')#xpath取新闻标题
    title1 = ''.join(title)#把列表转为str类型
    source =topic.xpath('//*[@class="abs"]/span[1]/text()')
    source1 = ''.join(source)[3:]#把列表类型转为str   新闻来源只要后面来源: 之后  表示[3:]  如果是3之前 [3:]
    text = [x for x in texts]#链接地址  也就是html的地址     (列表推导式,把新闻内容写入列表)  
    if text != []:
        print(a,title1,source1,text)# 因为抓取的数据,有的有,有的没有,所以把有数据的打印出来

Insert picture description here
4. Find the content of the article
Insert picture description here

The following code can refer to

Article view and download: https://blog.csdn.net/weixin_41665637/article/details/103051444
python read file and save file template: https://blog.csdn.net/weixin_41665637/article/details/103030166
crawler xpath: https://blog.csdn.net/weixin_41665637/article/details/90637175

Note: If you have any questions, please leave a message.
Who can help solve the file download, please leave a message and tell me how to do it. Thank you~~~

Guess you like

Origin blog.csdn.net/weixin_41665637/article/details/109182128