Python simply crawls web data

When crawling my CSDN webpage: https://blog.csdn.net/zhaoweiya
import requests, a red line appears. At this time, we point the cursor to the requests, press the shortcut key: alt + enter, pycharm will give a solution, At this time, select install package requests and pycharm will automatically install it for us. We only need to wait a moment for the library to be installed. The installation method of lxml is the same

import requests
from lxml import etree
header = {
    
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
html = requests.get("https://blog.csdn.net/zhaoweiya",headers=header)
etree_html = etree.HTML(html.text)
content = etree_html.xpath('//*[@id="articleMeList-blog"]/div[2]/div/h4/a/text()')
for each in content:
    replace = each.replace('\n', '').replace(' ', '')
    if replace == '\n' or replace == '':
        continue
    else:
        print(replace)

Some results were intercepted:

找出列表list中的重复元素
Python列表去重的多种方法
python+selenium滚动条/内嵌滚动条循环下滑,判断是否滑到最底部
Python特殊函数lambdamapfilter
Python嵌套函数和装饰器
python正序循环使用remove和delect删除报index溢出错误
decimal报错:decimal.InvalidOperation:[class‘decimal.ConversionSyntax‘>]

Reference: https://blog.csdn.net/IT_XF/article/details/82184585

Guess you like

Origin blog.csdn.net/zhaoweiya/article/details/109584565