Table of contents
foreword
I was on a business trip recently and found that there was Xiaoqiang in the hotel I was staying in. So when I was bored on a business trip, I wrote some crawler code for fun, and asking was just the right occasion. This article is mainly to crawl the 100 titles of the CSDN comprehensive hot list , and then extract keywords by word segmentation, and count the word frequency.
I thought about it, it is still useful for other bloggers, you can see what titles can be on the hot list, and share it. By the way, let me talk about the methods to solve various problems.
environment
The IDE used is: spyder (if you are not used to looking at the interface, bear with it, it is not critical)
The page is crawled using chromedriver. As for the reason, I will talk about it later.
Word breaker: jieba
Crawl page address: https://blog.csdn.net/rank/list
crawler code
Let me talk about why the source code of the page is not directly obtained with requests, mainly because the page cannot directly request the source code. Instead, scroll to the bottom of the page to display all 100 ranked articles.
So my idea is to use chromedriver, and then execute js to scroll the page to the bottom.
Here we need to explain the download of chromedriver, which needs to be based on the version of your google browser. My laptop is a mac, you can click Chrome in the upper left corner, and then click About Google Chrome to see your browser version.
Share the download address of chromedriver: google chrome driver download address
Briefly explain the principle of the driver, which is to simulate the operation of opening the url in the browser, just like our hands, the specific principle can be discussed another day.
No more nonsense, go to the crawler tool code
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Nov 4 17:15:06 2021
@author: huyi
"""
from selenium import webdriver
import time
# =============================================================================
# 爬取动态下滑加载网页
# =============================================================================
def pa(url):
driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get(url)
js = '''
let height = 0
let interval = setInterval(() => {
window.scrollTo({
top: height,
behavior: "smooth"
});
height += 500
}, 500);
setTimeout(() => {
clearInterval(interval)
}, 20000);
'''
driver.execute_script(js)
time.sleep(20)
source = driver.page_source
driver.close()
return source
code description
1. The code is mainly a tool method, using diver to open the browser. Then through the js code, simulate the operation of scrolling down.
2. According to your network conditions, you can adjust the timeout period inside. Avoid scrolling to the bottom and end it, because the network of my hotel is relatively slow, so the setting is relatively large.
3. Return to the source code of the page, for later xpath analysis.
verify
OK, I have got the source code of the page.
keyword extraction code
We also prepare the method of keyword extraction. No nonsense, on the code.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Nov 4 21:53:22 2021
@author: huyi
"""
import jieba.analyse
def get_key_word(sentence):
result_dic = {}
words_lis = jieba.analyse.extract_tags(
sentence, topK=3, withWeight=True, allowPOS=())
for word, flag in words_lis:
if word in result_dic:
result_dic[word] += 1
else:
result_dic[word] = 1
return result_dic
code description
1. Briefly explain, the method uses the three words with the highest weight, which can be adjusted according to your preference.
2. Make a count of the same words, which is convenient for counting the frequency of 100 title keywords.
main program code
The main program is mainly to extract the title in the source code, use lxml to extract the elements, and obtain the title. Then output the result text after word frequency statistics.
No nonsense, on the code.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Nov 4 14:01:38 2021
@author: huyi
"""
from lxml import etree
from tools.dynamic_page import pa
from tools.analyse_word import get_key_word
csdn_url = 'https://blog.csdn.net/rank/list'
source = etree.HTML(pa(csdn_url))
titles = source.xpath("//div[@class='hosetitem-title']/a/text()")
key_word_dic = {}
for x in titles:
if x:
for k, v in get_key_word(x).items():
if k.lower() in key_word_dic:
key_word_dic[k.lower()] += v
else:
key_word_dic[k.lower()] = v
word_count_sort = sorted(key_word_dic.items(),
key=lambda x: x[1], reverse=True)
with open('result.txt', mode='w', encoding='utf-8') as f:
for y in word_count_sort:
f.write('{},{}\n'.format(y[0], y[1]))
code description
1. How to get xpath? Google browser supports right-clicking to copy directly, but it is still recommended to learn about xpath related syntax.
2. Lowercase all English words to avoid repetition.
3. Arrange the output in reverse order of word frequency, with the word with the most frequency first.
Validation results
OK, not surprisingly, java is yyds.
Summarize
You can see that there are some symbols in the final statistics, how to say? It can be removed by jieba stop words, depending on how you screened.
To clarify, the case study in this article is only for exploration and use, not for malicious attacks.
If this article is useful to you, please don't be stingy with your likes , thank you.