python实现Word Count

个人gitee word count项目地址：https://gitee.com/qq654488767/system_design_and_analysis

1.项目简介

需求简介：

WordCount的需求可以概括为：对程序设计语言源文件统计字符数、单词数、行数，统计结果以指定格式输出到默认文件中，以及其他扩展功能，并能够快速地处理多个文件。

可执行程序命名为：wc.exe，该程序处理用户需求的模式为：

wc.exe [parameter] [input_file_name]

存储统计结果的文件默认为result.txt，放在与wc.exe相同的目录下。

实现的功能：

usage: WordCount.exe [-h] [-c] [-w] [-l] [-s] [-a] [-e [E]] [-o OUTPUT] [-x]
infile

positional arguments:
infile

optional arguments:
-h, --help show this help message and exit
-c, --character show the number of characters
-w, --word show the number of words
-l, --line show the number of lines
-s, --recursive process files in the current directory recursively
-a, --all count detailed data that includes the amount of code line,
blank line, comment line
-e [E] count words without stop words in a given filename
-o OUTPUT, --output OUTPUT
-x, --interface show the interface of this program

2.PSP2.1表格

PSP2.1	PSP阶段	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	20	30
· Estimate	· 估计这个任务需要多少时间	420	840
Development	开发	240	600
· Analysis	· 需求分析 (包括学习新技术)	60	100
· Design Spec	· 生成设计文档	30	0
· Design Review	· 设计复审 (和同事审核设计文档)	50	0
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	30	0
· Design	· 具体设计	60	40
· Coding	· 具体编码	420	400
· Code Review	· 代码复审	30	60
· Test	· 测试（自我测试，修改代码，提交修改）	30	40
Reporting	报告	30	0
· Test Report	· 测试报告	30	0
· Size Measurement	· 计算工作量	10	10
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	60	0
	合计	1520	580

3.设计思路

3.1开发语言选择

因为任务可能涉及到UserInterface的设计，我自身对于C语言的MFC框架也没有太多了解，并且自己也学了很久的Java，希望能通过另一种方式来实现该任务，所以最后我选择python作为我的开发语言。

3.2整体设计

我把整个程序的结构划分为三部分。第一部分是对于命令行参数的解析，第二部分是根据解析的结果选择执行的函数，第三部分是将执行结果汇总并写入到文件。

对于命令行参数的解析，我通过argparse模块来添加参数并解析用户输入的参数，这样自己主体时间就可以放在对各个功能模块的编写上，节省做整个项目的时间。

第二部分可以通过解析的结果调用对应各个命令的方法，将结果添加到一个dict中，传入写入文件的函数。

第三部分就轻松了，通过遍历dict将结果写入到文件。

4.关键代码分析

4.1处理参数

def parse_args(parser):
    """ parse command line input

    :param parser: current argument parser
    :return: arg
    :type: dict
    """

    # add command arguments, -c,-w,-l... 
    parser.add_argument("-c", "--character", action="store_true", help="show the amount of characters")
    parser.add_argument("-w", "--word", action="store_true", help="show the amount of words")
    parser.add_argument("-l", "--line", action="store_true", help="show the amount of lines")
    parser.add_argument("-s", "--recursive", action="store_true", help="process files in current directory recursively")
    parser.add_argument("-a", "--all", action="store_true",
                        help="count detailed data that includes amount of code line, blank line, comment line")
    parser.add_argument("-e", nargs="?", default="stopword.txt",
                        help="count words without stop words in given name")
    parser.add_argument("infile")
    parser.add_argument("-o", "--output")
    parser.add_argument("-x", "--interface", action="store_true", help="show the interface of this program")
    # here does all the processing work
    args = parser.parse_args()

    return args

因为我选择通过第三方库来实现命令行的解析，所以整体过程较为轻松，代码较为简洁明了。

4.2调用函数

def handle_parameters(args):
    """do different works according to the result of args

    :param args: the parsed argument dict
    :return: dict
    """
    # check if input filename type is file we can handle
    if not is_valid_file_name(args.infile):
        print("error:{} is not a valid file name!".format(args.infile))
        return

    result_dic = {}

    # if -x is inside command line option, exit after finishing
    if args.interface:
        app = wx.App()
        frm = MyFrame1(None)
        frm.Show()
        app.MainLoop()
        return

    # if we need to handle valid files recursively, we should void checking input filename,
    # for example, *.cpp, we can‘t open this file, we should get it suffix instead.
    if args.recursive:
        # get each filename and check whether it's the file we should handle.
        for each_file in list(get_file_recursively(os.getcwd())):
            if not is_valid_file_name(each_file):
                continue
            # split filename to compare it's suffix
            if not each_file.split(".")[1] == args.infile.split(".")[1]:
                continue
            # set default output filename
            cur_file_result_dic = {OUTPUT_FILENAME: "result.txt"}
            # read file content
            with open(each_file, 'r', encoding="utf-8") as f:
                file_content = f.read()
            # args is a dict itself, and all the actions have been set to store_true,
            # so we can just get this item to check whether it's true and do the corresponding function.
            if args.character:
                cur_file_result_dic[CHARACTER_COUNT_RESULT] = count_character(file_content)
            if args.word:
                cur_file_result_dic[WORD_COUNT_RESULT] = count_word(file_content, args.e)
            if args.line:
                cur_file_result_dic[LINE_COUNT_RESULT] = count_line(file_content)
            if args.output:
                cur_file_result_dic[OUTPUT_FILENAME] = args.output
            if args.all:
                cur_file_result_dic[CODE_LINE_COUNT] = count_code_line(file_content)
                cur_file_result_dic[BLANK_LINE_COUNT] = count_blank_line(file_content)
                cur_file_result_dic[COMMENT_LINE_COUNT] = count_comment_line(file_content)
            # record to result_dic of each file
            result_dic[each_file] = cur_file_result_dic
    # if not recursive mode
    else:
        # same process
        cur_file_result_dic = {OUTPUT_FILENAME: "result.txt"}
        file_content = is_valid_file(args.infile).read()
        if args.character:
            cur_file_result_dic[CHARACTER_COUNT_RESULT] = count_character(file_content)
        if args.word:
            cur_file_result_dic[WORD_COUNT_RESULT] = count_word(file_content, args.e)
        if args.line:
            cur_file_result_dic[LINE_COUNT_RESULT] = count_line(file_content)
        if args.output:
            cur_file_result_dic[OUTPUT_FILENAME] = args.output
        if args.all:
            cur_file_result_dic[CODE_LINE_COUNT] = count_code_line(file_content)
            cur_file_result_dic[BLANK_LINE_COUNT] = count_blank_line(file_content)
            cur_file_result_dic[COMMENT_LINE_COUNT] = count_comment_line(file_content)
        # os.getcwd is to keep the same format of input files
        # so that we can handle it same way in write to file function.
        result_dic[os.getcwd() + args.infile] = cur_file_result_dic

    return result_dic

这里我们通过获取argparse解析过了的args来获得对应的命令。解析过后的args是一个字典，通过设定的参数获取，之前在add_argument方法中我们设置action为store_true，这里直接判断特定命令是否存在来调用相对应的方法。这里要注意的是必须得先判断是否显示User interface，如果是，后面的命令就不用再执行了，可以直接通过界面来操作。

4.3写入文件

def write_to_file(result_dic, mode="w"):
    """write or append data to file
    
    :param result_dic: result dict
    :param mode: file process mode
    :return: none
    """
    # Cause I store output file path inside of each input filename dict, 
    # so we have to go inside the dict to get output file path.
    # bad design of data structure leads to bad code.
    
    result_file_path = ""
    if result_dic is None:
        return
    for each_key in result_dic.keys():
        result_file_path = result_dic[each_key].get(OUTPUT_FILENAME)
        break
    if result_file_path == "" or result_file_path is None:
        return
    # if output file path is valid
    # start writing
    with open(result_file_path, mode, encoding="utf-8") as f:
        for each_key in result_dic.keys():
            # remove prefix
            f.write(each_key[len(os.getcwd()) + 1:] + ",")
            f.write("字符数," + str(result_dic[each_key].get(CHARACTER_COUNT_RESULT)) + ",") if result_dic[each_key].get(
                CHARACTER_COUNT_RESULT) is not None else None
            f.write("单词数," + str(result_dic[each_key].get(WORD_COUNT_RESULT)) + ",") if result_dic[each_key].get(
                WORD_COUNT_RESULT) is not None else None
            f.write("行数," + str(result_dic[each_key].get(LINE_COUNT_RESULT)) + ",") if result_dic[each_key].get(
                LINE_COUNT_RESULT) is not None else None
            f.write("代码行数," + str(result_dic[each_key].get(CODE_LINE_COUNT)) + ",") if result_dic[each_key].get(
                COMMENT_LINE_COUNT) is not None else None
            f.write("注释行数," + str(result_dic[each_key].get(COMMENT_LINE_COUNT)) + ",") if result_dic[each_key].get(
                COMMENT_LINE_COUNT) is not None else None
            f.write("空白行数," + str(result_dic[each_key].get(BLANK_LINE_COUNT)) + ",") if result_dic[each_key].get(
                BLANK_LINE_COUNT) is not None else None
            f.write("\n")

一开始写代码的时候数据结构没有设计好，导致必须得先解析字典获取输出文件名。

5.测试设计

5.1测试用例

WordCount -o
WordCount test.cpp
WordCount -o test.cpp
WordCount -x test.cpp
WordCount -l -c -w test.cpp
WordCount -l -c -a -o result_test.cpp test.cpp
WordCount -l -c -a -w -s *.cpp
WordCount -l -c -a -w -s -e stop_word.txt *.cpp
WordCount -l -c -a -w -s *.cpp

5.2测试结果

WordCount -o

WordCount test.cpp

WordCount -o test.cpp

WordCount -x test.cpp

WordCount -l -c -w test.cpp

WordCount -l -c -a -o result_test.cpp test.cpp

pycharm测试所得数据：

#include<stdio.h>
// hello this is comment



int main(){
    printf("Hello world!");
    return 0;
    /*
        //
    this is test
    */
}

WordCount -l -c -a -w -s -e stop_word.txt *.cpp

WordCount -l -c -a -w -s *.cpp

stop word 为include

6.参考文献

《构建之法--现代软件工程》 --邹新 [第三版]