PTA 7-44 File similarity based on word frequency (string processing + set container)

Test points for this question:

String processing
set container use

A simple and original file similarity calculation is implemented, that is, the similarity is defined by the ratio of the common vocabulary of the two files to the total vocabulary. In order to simplify the problem, Chinese is not considered here (because word segmentation is too difficult), only English words with a length of not less than 3 and no more than 10 are considered, and only the first 10 letters are considered if the length exceeds 10.
Input format: The
input first gives a positive integer N (≤100), which is the total number of files. Then the content of each file is given in the following format: first, the text of the file is given, and finally only one character # is given in a line, indicating the end of the file. After the end of the N content of the document, given the query total number M (≦ 10
. 4
), followed by M lines, each pair of document numbers is given, separated by a space therebetween. It is assumed here that the files are numbered from 1 to N in the order given.
Output format: For
each query, the similarity of the two files is output in one line, that is, the common vocabulary of the two files accounts for the percentage of the total vocabulary of the two files, accurate to 1 decimal place. Note that a "word" here only includes English words composed of English letters only, with a length of not less than 3 and no more than 10, and those with a length of more than 10 only consider the first 10 letters. Words are separated by any non-English letters. In addition, the same word with different capitalization is considered to be the same word, for example, "You" and "you" are the same word.
Sample input:
3
Aaa Bbb Ccc
#
Bbb Ccc Ddd
#
Aaa2 ccc Eee
is at Ddd @ Fff
#
2
1 2
1 3
Sample output:
50.0%
33.3%

This question is mainly about the processing of strings. We need to read the data of each line, and then process the segments according to whether they are English letters, and then use the set container to store different data.

This year's epidemic is at home. Sometimes because of problems, there are a lot of chores to deal with, so the efficiency is not high, I still have many goals to achieve, and there are many things to do. My life will be wonderful, come on!

The complete code is as follows:

#include <iostream>
#include <set>
#include <string>
#include <cctype>
using namespace std;

#define MAXN 105

int N, M;                // 文件总数，查询总数
string str;              // 读取每一行
set<string> files[MAXN]; // 文件中的单词

void handleStr(string str, int No)
{
    string word;
    str += "."; // 最后一个单词能够处理
    for (int i = 0; i < str.size(); i++)
    {
        if (isalpha(str[i]))
        {
            if (word.size() < 10)
                word += tolower(str[i]);
        }
        else
        {
            if (word.size() > 2 && word.size() < 11)
                files[No].insert(word);
            word.clear();
        }
    }
}

int main()
{
    scanf("%d", &N);
    for (int i = 1; i <= N; i++)
    {
        do
        {
            getline(cin, str);
            handleStr(str, i);
        } while (str != "#");
    }
    scanf("%d", &M);
    int u, v;
    int same = 0, total = 0;
    for (int i = 0; i < M; i++)
    {
        scanf("%d%d", &u, &v);
        total = (int)files[u].size() + (int)files[v].size();
        same = 0;
        for (set<string>::iterator it = files[u].begin(); it != files[u].end(); it++)
        {
            if(files[v].find(*it) != files[v].end())
            {
                same++;
                total--;
            }
        }
        printf("%.1f%%\n", total == 0 ? 0 : same * 100.0 / total);
    }
    return 0;
}

PTA 7-44 File similarity based on word frequency (string processing + set container)

Guess you like