Use Python to solve the classification and summary of massive data ~ the artifact of one-click office

Realistic creation comes from inspiration of life!

Project introduction

Yesterday, I suddenly found a headache. One piece of data is the dormitory data of a certain school. There are different dimensions of classification. The total data is about 4000 data. It needs to be classified, and then the table is made according to different classification dimensions. Finally, Generate 8 folders, each folder contains 24 tables, this is the final function of our program. If we use Excel to filter many times, and we need a few people to work together, which is more laborious, then as a Python artifact for data analysis, can we solve this problem? The answer is of course!

Project idea

1. First import this large amount of data, use the CSV library, then write and parse according to Python objects, and finally store it in the pycharm running memory space to facilitate our next operation.
2. After importing, we need to classify. At this time, we need to write an algorithm. I call it "dictionary iteration algorithm". Of course, I named it myself. There are many pits involved. Finally, we need to encapsulate this function. stand up.
3. Data saving, that is, writing data to CSV files, and finally using Python's built-in module OS for folder classification and creation, and finally saving data. At this time, we have to solve the Chinese garbled problem of CSV files.

difficulty

1. How to divide the data after parsing the data and save it
2. How to solve the garbled problem when writing the file
3. How to structure our code program

Code introduction

The general idea is this, let’s take a look at the functional steps of this program in detail.

Analytical data

# 1.解析CSV海量数据,用字典保存在内存空间
def csv_data():
    global dormitory_data
    import csv
    dormitory_data = []
    with open(r"寝室数据.csv", encoding='utf-8-sig') as file:#将你的CSV文件和该程序文件放在一个文件夹下面
        f_csv = csv.reader(file)#读取文件里面的每一行数据,转换为列表赋值给新的变量
        header = next(f_csv)#利用迭代的方法,直接取出表头行(标题行),更新f_csv的数据,去除了标题行
        for row in f_csv:
            data = {
    
    }
            for index in range(7):
                data[header[index]] = row[index]
            dormitory_data.append(data)

Here we modify the suffix of an Excel data to become the suffix of the CSV file, and then we will import and parse the data.

This analysis process is similar to our previous article "Writing a Score Calculation System in Python" . The main thing to understand is that the extraction of header rows is very iterative analysis of data, and finally stored in a list. Note that it is generally necessary to declare global variables.

Effect execution

Insert picture description here
Split data

# 分割数据,按照数据的特点
def csv_sort():
    global dicts
    dicts=[];i = 0
    dormitory_datas = dormitory_data.copy()#字典迭代删除迭代数据是一个坑,需要我们时刻更新数据库值
    dormitory_datass= dormitory_data.copy()
    for x in dormitory_datass:
        b = []
        for sort in dormitory_datass:
            a_1 = sort["宿舍编号"]
            b.append(a_1)
        dicts.append(x)
        dormitory_data.remove(x)
        dormitory_datass=dormitory_data.copy()
        if b[i][:3] != b[i+1][:3]:
            break

Don't underestimate these few lines of code here. The algorithm in this is implemented after repeated testing. There are a few pits in it, which is really a headache, but fortunately, it was finally solved.

1. First, we need to divide the data according to an algorithm. After browsing the data, we find that the data of the 1-4 dormitories of each group are related. The number of the dormitory on the 1st to the 2nd floor is based on the first three data. Nodes, perform index judgment, so as to iterate each data, and then compare, and finally if they are different, we will find that it must be a different floor, and we need to split the data.

2. But we found that after we jumped out of the loop, that is, after iterating the bedroom data on the first floor, we were surprised that although the data of the dictionary has changed, but it has also changed. This is the first pit, because the list is There is a feature of deletion. It uses iterative index to delete . This solution was mentioned in my previous computer secondary Python programming language design-intractable disease knowledge points . Finally, I used the copy storage of the dictionary to constantly update and make up for the data dictionary. The bug was solved. At this time, I really need to calm down and think slowly.

3. Use the dictionary iteration algorithm to determine when the data needs to be divided, and finally encapsulate the function.

save data

#保存数据,按照不同的分类
def keep_data():
    import csv
    import os
    import codecs
    for w in range(65,73):
        W=chr(w)
        path = '%s栋寝室'%W      # 创建总的文件夹
        if not os.path.exists(path):
            os.mkdir(path)
            os.chdir(path)
        else:
            os.chdir(path)
        a = []
        dict = dormitory_data[0]
        for headers in dict.keys():  # 把字典的键取出来,注意不要使用sorted不然会导致键的顺序改变
            a.append(headers)
        header = a  # 把列名给提取出来,用列表形式呈现
        for k in range(1,5):
            K=k
            for p in range(1,7):
                P=p
                csv_sort()
                with open('%s组%d栋%d楼.csv'%(W,K,P ),'a', newline='', encoding='utf-8-sig') as f:
                    writer = csv.DictWriter(f, fieldnames=header,)  # 提前预览列名,当下面代码写入数据时,会将其一一对应。
                    writer.writeheader()  # 写入列名
                    writer.writerows(dicts)  # 写入数据
                print("{}组{}栋寝室{}楼数据已经写入成功!!!! ! !".format(W,K,P))

This function also has several pits. First, we need to design the iterative for loop to save the data, and use the OS module to continue to automatically create the folder, and finally name the data for us to view. There is also our CSV file. The encoding is in utf-8 mode, but the encoding in Excel is different, this will cause garbled problems in the form of our Chinese data.

So we used this method to solve it

encoding='utf-8-sig'

Let's take a look at the demonstration effect of the overall operation

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Code upgrade

1. We can also refer to some methods to automatically make this table data and add header file information. Of course, I will not demonstrate here. You can find different solutions by yourself.
2. We can also draw grid lines on the data table to make our table more beautiful, such as the font is in the middle.
3. Write an automatic printing program, link to our computer printer, and print the data in one click. The earth has improved our efficiency.

Readers can implement these functions by themselves, I won't explain them here. After all, it is not easy to involve code and projects, hahahaha!

Automated office and one-click processing are the strengths of Python. We can use its functions to solve the problems of our study and life. Finally, I want to pay tribute to the staff who click Excel crazy every day for data sorting. Big, difficult, boring, boring

The last thing I want to say is that although the design project program is a headache, it can be transplanted and continuously upgraded. In the end, it takes 1 hour for others, and you only need 3 seconds to run and check!

Program source code

# -*- coding :  utf-8 -*-
# @Time      :  2020/9/15 13:26
# @author    :  王小王
# @Software  :  PyCharm
# @File      :  寝室数据分类.py-1.0版本
# @CSDN      :  https://blog.csdn.net/weixin_47723732

# 1.解析CSV海量数据,用字典保存在内存空间
def csv_data():
    global dormitory_data
    import csv
    dormitory_data = []
    with open(r"寝室数据.csv", encoding='utf-8-sig') as file:#将你的CSV文件和该程序文件放在一个文件夹下面
        f_csv = csv.reader(file)#读取文件里面的每一行数据,转换为列表赋值给新的变量
        header = next(f_csv)#利用迭代的方法,直接取出表头行(标题行),更新f_csv的数据,去除了标题行
        for row in f_csv:
            data = {
    
    }
            for index in range(7):
                data[header[index]] = row[index]
            dormitory_data.append(data)

# 分割数据,按照数据的特点
def csv_sort():
    global dicts
    dicts=[];i = 0
    dormitory_datas = dormitory_data.copy()#字典迭代删除迭代数据是一个坑,需要我们时刻更新数据库值
    dormitory_datass= dormitory_data.copy()
    for x in dormitory_datass:
        b = []
        for sort in dormitory_datass:
            a_1 = sort["宿舍编号"]
            b.append(a_1)
        dicts.append(x)
        dormitory_data.remove(x)
        dormitory_datass=dormitory_data.copy()
        if b[i][:3] != b[i+1][:3]:
            break

#保存数据,按照不同的分类
def keep_data():
    import csv
    import os
    import codecs
    for w in range(65,73):
        W=chr(w)
        path = '%s栋寝室'%W      # 创建总的文件夹
        if not os.path.exists(path):
            os.mkdir(path)
            os.chdir(path)
        else:
            os.chdir(path)
        a = []
        dict = dormitory_data[0]
        for headers in dict.keys():  # 把字典的键取出来,注意不要使用sorted不然会导致键的顺序改变
            a.append(headers)
        header = a  # 把列名给提取出来,用列表形式呈现
        for k in range(1,5):
            K=k
            for p in range(1,7):
                P=p
                csv_sort()
                with open('%s组%d栋%d楼.csv'%(W,K,P ),'a', newline='', encoding='utf-8-sig') as f:
                    writer = csv.DictWriter(f, fieldnames=header,)  # 提前预览列名,当下面代码写入数据时,会将其一一对应。
                    writer.writeheader()  # 写入列名
                    writer.writerows(dicts)  # 写入数据
                print("{}组{}栋寝室{}楼数据已经写入成功!!!! ! !".format(W,K,P))

def main():
    csv_data()
    keep_data()

if __name__ == '__main__':
    main()






One word per text

Apply what you have learned to apply it!

Guess you like

Origin blog.csdn.net/weixin_47723732/article/details/108620540