CSDN articles crawl the top ten blogger articles and convert them to md

CSDN crawling

python+selenium+parsel+tomd

tansty created

Code address:
github
gitee


## One, required knowledge*** 1.Basic knowledge of parser module***

(1) CSS selectors
need to create a parsel.Selector object
from parsel import Selector
html, which can be the source code of a web page request, or html, string
selector = Selector(html) in xml format,
after creating the Selector object start using the
tags = selector.css ( '. content')
CSS we usually use in, be modified when a particular tag, using .class_attr
here, too
.content refers all inquiries class label for the content of
inquiry The result of is a special object, and the required data cannot be obtained directly.
To convert the result of the css() function query into a string or a list, a function is required
• get()
• getall()

(2) Attribute extraction

href_value = selector.css('a::attr(href)').get()   #提取href标签的值
title=page.css(".title-article::text").get()      #提取文本内容

2.
The method of selenium to select elements
find_element_by_class_name: locate according to class

find_element_by_css_selector: According to css positioning

find_element_by_id: locate according to id

find_element_by_link_text: locate according to the text of the link

find_element_by_name: locate based on node name

find_element_by_partial_link_text: locate according to the text of the link, as long as it is included in the entire text

find_element_by_tag_name: locate by tag

find_element_by_xpath: Use Xpath to locate

PS: Changing element to elements will locate all eligible elements and return a List

比如:find_elements_by_class_name

What is returned is the web_element object

3.tomd
text=tomd.Tomd(content).markdown
converts the obtained article into markdown form

Two, code display

1. Get an article

#对一篇文章的爬取
def spider_one_csdn(title_url):    # 目标文章的链接
    html=requests.get(url=title_url,headers=head).text
    page=parsel.Selector(html)
    #创建解释器
    title=page.css(".title-article::text").get()
    title=filter_str(title)
    print(title)
    content=page.css("article").get()
    content=re.sub("<a.*?a>","",content)
    content = re.sub("<br>", "", content)
    #过滤a标签和br标签
    text=tomd.Tomd(content).markdown
    #转换为markdown 文件
    path = os.getcwd()  # 获取当前的目录路径
    file_name = "./passage"
    final_road = path + file_name
    try:
        os.mkdir(final_road)
        print('创建成功!')
    except:
        # print('目录已经存在或异常')
        pass
    with open(final_road+r"./"+title+".md",mode="w",encoding="utf-8") as f:
        f.write("#"+title)
        f.write(text)
    time.sleep(1)

2. Get all bloggers' articles

def get_article_link(user):
    #获取某个博主的所有文章
    page=1
    while True:
        link = "https://blog.csdn.net/{}/article/list/{}".format(user, page)
        print("现在爬取第", page, "页")
        html = requests.get(url=link, headers=head).text
        cel = parsel.Selector(html)
        name_link = cel.css(".article-list h4 a::attr(href) ").getall()
        if not name_link:
            break
            #没有文章就退出
        for name in name_link:
            spider_one_csdn(name)
        page+=1
    time.sleep(1)

3. Get the blogger's name

def nb_bozhu():
    #获取前十博主的csdn名称
    driver=webdriver.Chrome()
    driver.implicitly_wait(10)
    driver.get("https://blog.csdn.net/rank/writing_rank")
    names=driver.find_elements_by_xpath("//div[@class='rank-item-box d-flex align-items-center']//div[@class='name d-flex align-items-center']/h2/a")
    name_list=[]
    for name in names:
        final_name=name.get_attribute("outerHTML")
        final_name=re.sub('<a href="https://blog.csdn.net/',"",final_name)
        final_name=re.sub('">.*</a>','',final_name)
        name_list.append(final_name)
        print(final_name)
    driver.quit()
    time.sleep(1)
    return name_list

After the final execution, a passage directory will be generated in the directory where the program is located, which contains all the articles
Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/tansty_zh/article/details/108363992