[Algorithm] Word frequency statistics in English essays

Table of contents

topic

sample essay

output example

Analysis of Algorithms

source code


topic

    1. Provide three English essays, and count the number of occurrences of each word in each essay

    2. Each word is separated by spaces, newlines or punctuation marks, regardless of case

    3. Print the 5 words with the highest frequency, print the word and the number of occurrences

    4. The printing priority of the word, and then according to the order of the word letters in the dictionary

    5. Prepositions, articles, conjunctions, adverbs, and pronouns are not counted

sample essay

test1.txt

In the flood of darkness, hope is the light. It brings comfort, faith, and confidence. 
It gives us guidance when we are lost, and gives support when we are afraid.
And the moment we give up hope, we give up our lives. 
The world we live in is disintegrating into a place of malice and hatred, where we need hope and find it harder. 
In this world of fear, hope to find better, but easier said than done, the more meaningful life of faith will make life meaningful.

test2.txt

No one can help others as much as you do. 
No one can express himself like you. 
No one can express what you want to convey. 
No one can comfort others in your own way. 
No one can be as understanding as you are. 
No one can feel happy, carefree, and no one can smile as much as you do. 
In a word, no one can show your features to anyone else.

test3.txt

Keep faith and hope for the future. 
Make your most sincere dreams, and when the opportunities come, they will fight for them. 
It may take a season or more, but the ending will not change. Ambition, best, become a reality. 
An uncertain future, only one step at a time, the hope can realize the dream of the highest. 
We must treasure the dream, to protect it a season, let it in the heart quietly germinal. 
However, we have to gently protect our hearts deep expectations, slowly dream, will achieve new life.

output example

test1.txt: 
hope 4
faith 2
find 2
give 2
gives 2

test2.txt: 
can 8
no 8
one 8
as 6
do 2

test3.txt: 
dream 3
future 2
hope 2
protect 2
season 2

Analysis of Algorithms

1. Read the file, read the contents of the file as a string and store it in the text string variable

2. String splitting, split the text file string content with ",.\n"

3. Through the characteristics of the map, the split strings are stored in the map as required to count the number of times at the same time (the map is sorted according to the key by default)

4. Store the data of the map in the vector, and perform word frequency sorting (stable sorting) through stable_sort()

5. Print the 5 words with the most frequent occurrences and the number of occurrences, which have been sorted in the vector

source code

#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <cstring>
#include <cstdlib>
#include <algorithm>
#include <vector>
using namespace std;

// 需要删掉的 介词、冠词、连词、副词、代词
vector<string> g_delWord = {
    "to", "in", "on", "for", "of", "from", "between", "behind", "by", "about", "at", "with", "than",
    "a", "an", "the", "this", "that",
    "and", "but", "or", "so", "yet",
    "often", "very", "then", "therefore",
    "i", "you", "we", "he", "she", "my", "your", "hes", "her", "our", "us", "it",
    "am", "is", "are",
    "when", "where", "who", "what",
    "will", "would"
};

struct compare
{
    bool operator()(const pair<int, string>& l, const pair<int, string>& r)
    {
        return l.first > r.first;
    }
};

int main()
{
    for (int i = 1; i <= 3; ++i)
    {
        // 获取文件名
        string fileName = "test";
        fileName += '0' + i;
        fileName += ".txt";

        // 读取文件信息
        fstream file;
        file.open(fileName, ios::in);   // 以只读方式打开文件,ios::out(只写),ios::app(追加)
        char text[4096];
        file.read(text, 4096);
        // cout << fileName << ": " << endl;
        // cout << text << endl << endl;

        // 字符串分割,将分割的结果存入map中
        map<string, int> mWords;
        const char* s = " ,.\n";
        char* p = strtok(text, s);
        while (p)
        {
            string word = static_cast<string>(p);
            string lwrWord;
            transform(word.begin(), word.end(), back_inserter(lwrWord) ,::tolower);     // 字符串大写转小写

            // 排除 介词、连词、副词、代词
            if (find(g_delWord.begin(), g_delWord.end(), lwrWord) == g_delWord.end())
            {
                mWords[lwrWord]++;       // map的 "[]" 的重载,有插入/查询/修改功能,返回值为键值对的second值或false
            }
            p = strtok(NULL, s);
        }

        // 遍历map
        // int cnt = 0;
        // for (const auto& e: mWords)
        // {
        //     cout << "(" << e.first << ", " << e.second << ")    ";
        //     ++cnt;
        //     if (cnt % 5 == 0)
        //     {
        //         cout << endl;
        //     }
        // }
        // cout << endl <<endl;

        // 将map中的数据存入vector中
        vector< pair<int, string> > vWords;     // "> >"之间空格,防止与部分编译的 ">>" 重载冲突
        for (const auto& e: mWords)
        {
            vWords.push_back(make_pair(e.second, e.first));
        }

        // 排序,sort排序存在不稳定缺陷,可以自定义sort排序规则,也可以使用stable_sort
        stable_sort(vWords.begin(), vWords.end(), compare());
        cout << fileName << ": " << endl;
        for (int j = 0; j < 5; ++j)
        {
            cout << vWords[j].second << " " << vWords[j].first << endl;
        }
        cout << endl;
    }

    return 0;
}

Guess you like

Origin blog.csdn.net/phoenixFlyzzz/article/details/130475119