Python crawls website novels and visualizes analysis

Time flies, time does not live, I have seen the scenery everywhere, but I love it, Mr. Lu Yao’s words "The reason why people are painful is to pursue the wrong things. If you don’t trouble yourself, others will never be able to trouble you. ", life is like this, all the excessive care is your own thoughts, sometimes you need to live a little more free and easy, learn to be a little less, read more books, people will naturally become enlightened, and things will naturally look at. Be thorough.

Reading through thousands of books, writing if there is a god, the book has its own golden house, the book has its own Yan Ruyu, and the poems and books are self-sufficient... China is the country of poetry, and it is the cradle of book culture. Books can be improved. Human tolerance. So today we will do a project. I will crawl the data of my favorite book "Life" and do a simple data analysis to find out some characteristics.

Let’s start on the right track without saying more!

Destination URL

https://www.cz2che.com/0/175/7710.html

This website has collected many excellent books, as well as poetry dictionaries, Chinese and foreign masterpieces, in which you can find your own book, quietly enjoy the charm of words...

Insert picture description here
Parse URL

When we crawl, we must first build our own framework. As for some websites, it will have some anti-crawling technology and some content will be encrypted. Then when we go to analyze, we must decrypt according to the characteristics of the website, collect data, and continue to go. Test and optimize our code repeatedly to achieve a certain effect. The code design this time has a portable effect, but some texts need to be matched to the data according to regular expressions or Xpath or beautifulsoup parsing methods.

Insert picture description here
The one that I use here is the most commonly used parsing library-Xpath, which is simple and easy to understand, but for some dynamic websites, then it is a bit tricky. At this time, we need to base our target URL To choose.

Okay, we are all ready for the preliminary work, so now we have to start with the idea of ​​our crawler!

Request URL

 headers = {
    
    
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
        }
        res=requests.get(url=url,headers=headers)
        # print(res.encoding)
        res.encoding='GBK'
        html=res.text
        # print(html)
        html_=etree.HTML(html)
        # print(html_)
        text=html_.xpath('//div[@class="panel-body content-body content-ext"]//text()')
        # text.remove("    ")

Crawlers are indispensable: the
first step:
disguise the header, disguise pycharm to become a browser, request the content of the URL, use headers as a parameter. The
second step:
encoding format, to know what the encoding of this URL is, this is what we sometimes The most overlooked step, then how do we know the encoding of this URL?
Enter document.charset in the console and press Enter, as shown in the figure:

Insert picture description here
Step 3:
Parsing the URL. The most commonly used library is Xlxm. First, we return the text format of the source code of the URL. Then we use html_=etree.HTML(html). HTML is the text parameter. After the source code is parsed, the data is matched.

For reference only, the page is constantly changing

text=html_.xpath('//div[@class="panel-body content-body content-ext"]//text()')

This returns a list each time. We store the data according to the characteristics of the paragraph, then our crawling data is simply completed.

Insert picture description here
Write file

  num=len(text)
        for s in range(num):
            file.write(text[s]+'\n')

Here is written into our txt format according to the characteristics of the paragraph. Of course, we only crawled the first chapter here, so how to crawl the entire book requires further analysis of the web page. We observed that the difference between our first chapter and the second chapter URL is a certain parameter inside. Has changed:

https://www.cz2che.com/0/175/7713.html
https://www.cz2che.com/0/175/7714.html

So this is more convenient. We only need to change this parameter every time we loop the request. We use format to solve this. Before a child's shoes asked, format is rarely used, hahaha, I just want to say One sentence: "When the book is used, I hate less"

Insert picture description here

Use beautifulsoup crawling process

Insert picture description here

Now our crawling is over! !

analyze data

After obtaining the data of a book, the first thing we thought of was that we used the jieba library to perform Chinese word segmentation, and finally to count the frequency of phrases appearing in the article, and finally sort the output, generate a word cloud image, and meet the requirements of visualization.

So let’s start reading data, word segmentation, statistics, visualization...

    a=file.read()
    b=jieba.lcut(a)

As for the jieba library, we often use this grammatical knowledge. For the operation of the jieba library, our computer level two also has certain requirements and understanding.

So how do we remove a lot of punctuation in the article, we need to use the in operation of the for loop;

Insert picture description here
Remove our punctuation, and finally count our phrases according to our standard statistical method

code show as below:

with open(r"人生.txt", encoding="utf-8") as file:
    a=file.read()
    b=jieba.lcut(a)
    for x in b:
        if x in ",。、;:‘’“”【】《》?、.!… ":
            continue
        else:
            if len(x)==1:
                ll.append(x)
            elif len(x)==2:
                lg.append(x)
            elif len(x)==3:
                lk.append(x)
            else:
                lj.append(x)
    with open(r"数据分析.txt", "w", encoding="utf-8") as f:
        for word in lj:#如果想要统计其他字符长度的,只需要换一个变量即可
            d[word] = d.get(word, 0) + 1
        ls = list(d.items())
        ls.sort(key=lambda x: x[1], reverse=True)
        for i in range(10):
            print(ls[i][0])
        for a in ls:
            # new_word=a[0] +' '+str(a[1])
            new_word=a[0]
            f.write(new_word+'\n')

Get the most occurrences of four-character words in this book:
Insert picture description here
Summary

1. Crawl data
2. Store data
3. Analyze data
Finally, we use word cloud to present this visualization to the user:
Insert picture description here

Show results

Insert picture description here
Insert picture description here

The code crawled this time is portable and feasible for all books on this website. This article only provides ideas. Those who need the source code can trust me!

One word per text

Every time of study and accumulation is the best arrangement. Don't complain about why the beauty is late, as long as you persist, beauty will always meet you unexpectedly!

Guess you like

Origin blog.csdn.net/weixin_47723732/article/details/109248600