How to crawl the title of the comprehensive hot list of CSDN, and count the frequency of keywords by the way|Reptile case

Table of contents

foreword

environment

crawler code

keyword extraction code

main program code

Summarize


foreword

I was on a business trip recently and found that there was Xiaoqiang in the hotel I was staying in. So when I was bored on a business trip, I wrote some crawler code for fun, and asking was just the right occasion. This article is mainly to crawl the 100 titles of the CSDN comprehensive hot list , and then extract keywords by word segmentation, and count the word frequency.

I thought about it, it is still useful for other bloggers, you can see what titles can be on the hot list, and share it. By the way, let me talk about the methods to solve various problems.

environment

The IDE used is: spyder (if you are not used to looking at the interface, bear with it, it is not critical)

The page is crawled using chromedriver. As for the reason, I will talk about it later.

Word breaker: jieba

Crawl page address: https://blog.csdn.net/rank/list

crawler code

Let me talk about why the source code of the page is not directly obtained with requests, mainly because the page cannot directly request the source code. Instead, scroll to the bottom of the page to display all 100 ranked articles.

 So my idea is to use chromedriver, and then execute js to scroll the page to the bottom.

Here we need to explain the download of chromedriver, which needs to be based on the version of your google browser. My laptop is a mac, you can click Chrome in the upper left corner, and then click About Google Chrome to see your browser version.

Share the download address of chromedriver: google chrome driver download address

Briefly explain the principle of the driver, which is to simulate the operation of opening the url in the browser, just like our hands, the specific principle can be discussed another day.

No more nonsense, go to the crawler tool code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Nov  4 17:15:06 2021

@author: huyi
"""

from selenium import webdriver
import time

# =============================================================================
# 爬取动态下滑加载网页
# =============================================================================
def pa(url):
    driver = webdriver.Chrome('/usr/local/bin/chromedriver')
    driver.get(url)
    js = '''
                let height = 0
        let interval = setInterval(() => {
            window.scrollTo({
                top: height,
                behavior: "smooth"
            });
            height += 500
        }, 500);
        setTimeout(() => {
            clearInterval(interval)
        }, 20000);
    '''
    driver.execute_script(js)
    time.sleep(20)
    source = driver.page_source
    driver.close()
    return source

code description 

1. The code is mainly a tool method, using diver to open the browser. Then through the js code, simulate the operation of scrolling down.

2. According to your network conditions, you can adjust the timeout period inside. Avoid scrolling to the bottom and end it, because the network of my hotel is relatively slow, so the setting is relatively large.

3. Return to the source code of the page, for later xpath analysis.

verify

OK, I have got the source code of the page.

keyword extraction code

We also prepare the method of keyword extraction. No nonsense, on the code.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Nov  4 21:53:22 2021

@author: huyi
"""


import jieba.analyse


def get_key_word(sentence):
    result_dic = {}
    words_lis = jieba.analyse.extract_tags(
        sentence, topK=3, withWeight=True, allowPOS=())
    for word, flag in words_lis:
        if word in result_dic:
            result_dic[word] += 1
        else:
            result_dic[word] = 1
    return result_dic

code description

1. Briefly explain, the method uses the three words with the highest weight, which can be adjusted according to your preference.

2. Make a count of the same words, which is convenient for counting the frequency of 100 title keywords.

main program code

The main program is mainly to extract the title in the source code, use lxml to extract the elements, and obtain the title. Then output the result text after word frequency statistics.

No nonsense, on the code.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Nov  4 14:01:38 2021

@author: huyi
"""
from lxml import etree
from tools.dynamic_page import pa
from tools.analyse_word import get_key_word


csdn_url = 'https://blog.csdn.net/rank/list'
source = etree.HTML(pa(csdn_url))

titles = source.xpath("//div[@class='hosetitem-title']/a/text()")
key_word_dic = {}
for x in titles:
    if x:
        for k, v in get_key_word(x).items():
            if k.lower() in key_word_dic:
                key_word_dic[k.lower()] += v
            else:
                key_word_dic[k.lower()] = v

word_count_sort = sorted(key_word_dic.items(),
                         key=lambda x: x[1], reverse=True)

with open('result.txt', mode='w', encoding='utf-8') as f:
    for y in word_count_sort:
        f.write('{},{}\n'.format(y[0], y[1]))

code description

1. How to get xpath? Google browser supports right-clicking to copy directly, but it is still recommended to learn about xpath related syntax.

2. Lowercase all English words to avoid repetition.

3. Arrange the output in reverse order of word frequency, with the word with the most frequency first.

Validation results

OK, not surprisingly, java is yyds.

Summarize

You can see that there are some symbols in the final statistics, how to say? It can be removed by jieba stop words, depending on how you screened.

To clarify, the case study in this article is only for exploration and use, not for malicious attacks.

If this article is useful to you, please don't be stingy with your likes , thank you.

Guess you like

Origin blog.csdn.net/zhiweihongyan1/article/details/121154001