Python 简单爬取网页数据

爬取我的CSDN网页:https://blog.csdn.net/zhaoweiya
import requests 时出现红线,这时候,我们将光标对准requests,按快捷键:alt + enter,pycharm会给出解决之道,这时候,选择install package requests,pycharm就会自动为我们安装了,我们只需要稍等片刻,这个库就安装好了。lxml的安装方式同理

import requests
from lxml import etree
header = {
    
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
html = requests.get("https://blog.csdn.net/zhaoweiya",headers=header)
etree_html = etree.HTML(html.text)
content = etree_html.xpath('//*[@id="articleMeList-blog"]/div[2]/div/h4/a/text()')
for each in content:
    replace = each.replace('\n', '').replace(' ', '')
    if replace == '\n' or replace == '':
        continue
    else:
        print(replace)

截取了部分结果:

找出列表list中的重复元素
Python列表去重的多种方法
python+selenium滚动条/内嵌滚动条循环下滑,判断是否滑到最底部
Python特殊函数lambdamapfilter
Python嵌套函数和装饰器
python正序循环使用remove和delect删除报index溢出错误
decimal报错:decimal.InvalidOperation:[class‘decimal.ConversionSyntax‘>]

参考:https://blog.csdn.net/IT_XF/article/details/82184585

猜你喜欢

转载自blog.csdn.net/zhaoweiya/article/details/109584565