CSDN crawling
python+selenium+parsel+tomd
tansty created
## One, required knowledge*** 1.Basic knowledge of parser module***
(1) CSS selectors
need to create a parsel.Selector object
from parsel import Selector
html, which can be the source code of a web page request, or html, string
selector = Selector(html) in xml format,
after creating the Selector object start using the
tags = selector.css ( '. content')
CSS we usually use in, be modified when a particular tag, using .class_attr
here, too
.content refers all inquiries class label for the content of
inquiry The result of is a special object, and the required data cannot be obtained directly.
To convert the result of the css() function query into a string or a list, a function is required
• get()
• getall()
(2) Attribute extraction
href_value = selector.css('a::attr(href)').get() #提取href标签的值
title=page.css(".title-article::text").get() #提取文本内容
2.
The method of selenium to select elements
find_element_by_class_name: locate according to class
find_element_by_css_selector: According to css positioning
find_element_by_id: locate according to id
find_element_by_link_text: locate according to the text of the link
find_element_by_name: locate based on node name
find_element_by_partial_link_text: locate according to the text of the link, as long as it is included in the entire text
find_element_by_tag_name: locate by tag
find_element_by_xpath: Use Xpath to locate
PS: Changing element to elements will locate all eligible elements and return a List
比如:find_elements_by_class_name
What is returned is the web_element object
3.tomd
text=tomd.Tomd(content).markdown
converts the obtained article into markdown form
Two, code display
1. Get an article
#对一篇文章的爬取
def spider_one_csdn(title_url): # 目标文章的链接
html=requests.get(url=title_url,headers=head).text
page=parsel.Selector(html)
#创建解释器
title=page.css(".title-article::text").get()
title=filter_str(title)
print(title)
content=page.css("article").get()
content=re.sub("<a.*?a>","",content)
content = re.sub("<br>", "", content)
#过滤a标签和br标签
text=tomd.Tomd(content).markdown
#转换为markdown 文件
path = os.getcwd() # 获取当前的目录路径
file_name = "./passage"
final_road = path + file_name
try:
os.mkdir(final_road)
print('创建成功!')
except:
# print('目录已经存在或异常')
pass
with open(final_road+r"./"+title+".md",mode="w",encoding="utf-8") as f:
f.write("#"+title)
f.write(text)
time.sleep(1)
2. Get all bloggers' articles
def get_article_link(user):
#获取某个博主的所有文章
page=1
while True:
link = "https://blog.csdn.net/{}/article/list/{}".format(user, page)
print("现在爬取第", page, "页")
html = requests.get(url=link, headers=head).text
cel = parsel.Selector(html)
name_link = cel.css(".article-list h4 a::attr(href) ").getall()
if not name_link:
break
#没有文章就退出
for name in name_link:
spider_one_csdn(name)
page+=1
time.sleep(1)
3. Get the blogger's name
def nb_bozhu():
#获取前十博主的csdn名称
driver=webdriver.Chrome()
driver.implicitly_wait(10)
driver.get("https://blog.csdn.net/rank/writing_rank")
names=driver.find_elements_by_xpath("//div[@class='rank-item-box d-flex align-items-center']//div[@class='name d-flex align-items-center']/h2/a")
name_list=[]
for name in names:
final_name=name.get_attribute("outerHTML")
final_name=re.sub('<a href="https://blog.csdn.net/',"",final_name)
final_name=re.sub('">.*</a>','',final_name)
name_list.append(final_name)
print(final_name)
driver.quit()
time.sleep(1)
return name_list
After the final execution, a passage directory will be generated in the directory where the program is located, which contains all the articles