Analysis, processing and screening of Python activity registration form

        Recently, it has been relatively leisurely. The department has successfully bid for a networking event. It is hosted with two other colleges with a relatively high proportion of girls. I am responsible for the preparation of the registration form. From the 13th to the present (1 am on the 19th), there have been more than 670 students signed up, and the male to female ratio tends to be 1:2.

        It seems to say a bit too much, because it is the actual school data, the data will not be made public, the registration form was made by the questionnaire star and downloaded to get the original data form. There are 8 original registration forms. The first name is required, the second gender is required, the third degree is required, the fourth college is required, and the fifth mobile phone number is required. Self-introduction and requirements for the other half are optional.

        The format of the exce downloaded from the questionnaire star is as follows:

0 serial number 1 Time to submit the answer sheet 2 time used 3 sources 4 source details 5 from IP 6Name 7 sex 8 degrees 9 schools 10 mobile phone number 11 personal photos 12 self-introduction

13 requirements for the other half

        For the 8 questions, the name, college, and mobile phone number are non-empty strings, gender and education are numbers, which respectively represent the serial numbers of the options. The photo is the attachment type. If uploaded, it will be displayed as the photo address, otherwise it will be the string'( Empty)', if the introduction and requirements are filled in, it is the corresponding string, otherwise it is'(empty)'.

        First of all, for colleges, some like to write abbreviations, and some write full spelling. For example, for the School of Culture and Journalism, some fill in "Wenxin", "Wenxin College", "Culture and Journalism College", etc. To facilitate statistics, it is necessary to make adjustments to the various college names entered by them. So I manually sorted out a table college_name.xls:

The first column is Quanpin, and the second column is the combination of keywords corresponding to the college in their various filling methods. Multiple spaces are separated by spaces, and the keywords of different colleges cannot be the same or contain the relationship.

 

college_name=xlrd.open_workbook(r'collegename.xls')#打开表格
cn=college_name.sheet_by_index(0)#获得第一张表
dictionary=dict()#字典存放关键词到学院的映射
mapping=dict()#字典存放学院到编号的映射
map_number=0#学院编号
start_row=1;#遍历开始行数 第一行为表头,所以跳过
row=start_row;
while row<cn.nrows:#cn.nrows:cn的行数
    wordset=re.split('  ',cn.cell_value(row,1))#通过split对多个关键词进行分割,放入wordset里
    for word in wordset:   
        dictionary[word]=cn.cell_value(row,0)#将每个关键词到学院全拼的映射存在dictionary字典
    if cn.cell_value(row,0) not in mapping:
        mapping[cn.cell_value(row,0)]=map_number
        map_number=map_number+1#如果学院名不在mapping的key中,则存入,并映射为map_number
    row=row+1

After the operation is completed, the mapping of keywords to the full name of the college and the mapping of the full name of the college to the number are obtained, and then the number of male and female applicants for each college is stored:

signtable=xlrd.open_workbook(r'联谊活动报名表.xls')#原始表
st=signtable.sheet_by_index(0)
row=start_row;
statistic=np.zeros([len(mapping),6]);#每一行代表一个学院的男生报名数、女生报名数、总报名人数占比、男生占比、女生占比和男女比例 先计算前两列
while row<st.nrows:
    college=st.cell_value(row,9);#或者原始输入的学院名
    find=0;#标记是否找到
    for keyword in dictionary:
        if keyword  in college:#如果dictionary的key中存在关键词是原始输入学院名的子串,则找到
            find=1;#
            statistic[mapping[dictionary[keyword]],int(st.cell_value(row,7)-1)]+=1
            break#将对应学院所在行(mapping映射的值)的第1列或第2列(取决于性别)+1
    if find==0:
        print(college+'can not find')
    row=row+1

Then use plt's pie chart to draw a pie chart:

import matplotlib.pyplot as plt
labels = [str(i) for i in range((len(mapping)))]
sizes = [(i[0]+i[1])for i in statistic]

plt.pie(sizes, labels=labels, autopct='%1.1f%%',
  shadow=True, startangle=90)
plt.axis('equal') 
plt.show()

result:

Since Chinese is not supported by default, the label of each block is set to the number of the mapping mapping.

Start to write the first table, the format is as follows:

0Name 1 Time to fill in the form/s 2 colleges 3 gender 4 academic qualifications 5 mobile phone number 6Photo upload situation 7 Introduction 8 requirements

 

import xlwt
workspace=xlwt.Workbook(encoding='ascii')
excel=workspace.add_sheet('报名表完整版',cell_overwrite_ok=True)#添加第一张表
excel.write(0,0,'姓名')
excel.write(0,1,'填表时长/s')
excel.write(0,2,'学院')
excel.write(0,3,'性别')
excel.write(0,4,'学历')
excel.write(0,5,'手机号码')
excel.write(0,6,'照片上传情况')
excel.write(0,7,'介绍')
excel.write(0,8,'要求')#第一行添加表头
for row in range(1,st.nrows):
    excel.write(row,0,st.cell_value(row,6))#写入姓名
    excel.write(row,1,st.cell_value(row,2)[0:-1])#原始数据为'xxx秒',为了方便以后筛选,在这去掉最后一个字符,故设置为取0:-1
    temp_college=st.cell_value(row,9)#原始学院名输入
    for keyword in dictionary:
        if keyword  in temp_college:
            excel.write(row,2,dictionary[keyword])#将在字典中查到的完整学院名写入
    excel.write(row,3,st.cell_value(row,7))#写入性别
    excel.write(row,4,st.cell_value(row,8))#写入学历
    excel.write(row,5,st.cell_value(row,10))#写入手机号码
    if st.cell_value(row,11)=='(空)':#如果照片未上传,则写为0,否则写入1
        excel.write(row,6,0)
    else:
        excel.write(row,6,1)
    excel.write(row,7,st.cell_value(row,12))#写入介绍
    excel.write(row,8,st.cell_value(row,13))#写入要求

In the same way, make a second table for publishing in the group at that time, so that they can match according to the content of the table, so the mobile phone number and gender are partially hidden. For the convenience of reading, use jieba to extract the core words of the introduction and requirements written by them:

import jieba.posseg as posseg
excel2=workspace.add_sheet('报名表简洁版',cell_overwrite_ok=True)    
excel2.write(0,0,'姓名')
excel2.write(0,1,'填表时长/s')
excel2.write(0,2,'学院')
excel2.write(0,3,'性别')
excel2.write(0,4,'学历')
excel2.write(0,5,'手机号码')
excel2.write(0,6,'照片上传情况')
excel2.write(0,7,'介绍')
excel2.write(0,8,'要求')

word_type=['n','nz','ns','vn','v','a','an']#保留的词性
for row in range(1,st.nrows):
    temp_name=st.cell_value(row,6)
    if len(temp_name)==2:
        write_name=temp_name[0]+'*'
    elif len(temp_name)==3:
        write_name=temp_name[0]+'*'+temp_name[2]
    else:
        write_name=temp_name[0]+'**'+temp_name[3]
    excel2.write(row,0,write_name)#对姓名的第一个字改为*,四字名字的第二三个字改为*
    excel2.write(row,1,st.cell_value(row,2)[0:-1])
    temp_college=st.cell_value(row,9)
    for keyword in dictionary:
        if keyword  in temp_college:
            excel2.write(row,2,dictionary[keyword])
    excel2.write(row,3,st.cell_value(row,7))
    excel2.write(row,4,st.cell_value(row,8))
    excel2.write(row,5,str(st.cell_value(row,10)[:3])+'****'+str(st.cell_value(row,10)[-4:]))#手机号码的中间四位改为*
    if st.cell_value(row,11)=='(空)':
        excel2.write(row,6,0)
    else:
        excel2.write(row,6,1)
    words_info = posseg.cut(st.cell_value(row,12))
    words_demm = posseg.cut(st.cell_value(row,13))#对他们的介绍和要求进行词性切分
    info=''
    demm=''
    for word, flag in words_info:
        if flag in word_type:
            info+=word#遍历切分后的每一个词word和他的词性flag,如果词性满足要求,则添加改词
    for word, flag in words_demm:
        if flag in word_type:
            demm+=word
    excel2.write(row,7,info)
    excel2.write(row,8,demm)

Some related parts of speech are as follows:

a adjective
an Nouns
d adverb
e interjection
m Quantifier
n noun
ns Place name
no Personal name
nt Organization name
nz Other names
v verb
vn Gerund

Calculate the 3-6 columns of statistic and write it into the third table as the data statistics written to each college:

excel3=workspace.add_sheet('数据统计',cell_overwrite_ok=True)
excel3.write(0,0,'学院名')
excel3.write(0,1,'报名人数')
excel3.write(0,2,'总占比')
excel3.write(0,3,'男生报名数')
excel3.write(0,4,'男生占比')
excel3.write(0,5,'女生报名数')
excel3.write(0,6,'女生占比')
excel3.write(0,7,'男女比例')
number=1
for college in mapping:
    excel3.write(number,0,college)#第三张表写入学院名
    boy=statistic[mapping[college],0];#获得该学院男生报名数,之前计算得
    girl=statistic[mapping[college],1];
    print(college+'总报名人数'+str(boy+girl))
    excel3.write(number,1,str(boy+girl))#第二列写入总数
    excel3.write(number,3,str(boy))#第四列写入男生人数
    excel3.write(number,5,str(girl))#第六列写入女生人数
    statistic[mapping[college],2]=round((boy+girl)/st.nrows,5)#储存数据在变量statistic里
    print('占'+str(statistic[mapping[college],2]))
    statistic[mapping[college],3]=round(boy/191,3)#
    statistic[mapping[college],4]=round(girl/384,3)
    if girl==0:
        girl=1#在计算男女比例时可能出现分母(女生为0)的情况,故设置为1
    statistic[mapping[college],5]=round(boy/girl,3)
    excel3.write(number,2,str(100*statistic[mapping[college],2])+'%')#分别百分数写入占比
    excel3.write(number,4,str(100*statistic[mapping[college],3])+'%')
    excel3.write(number,6,str(100*statistic[mapping[college],4])+'%')
    excel3.write(number,7,str(statistic[mapping[college],5]))
    print('男生占'+str(statistic[mapping[college],3]))
    print('女生占'+str(statistic[mapping[college],4]))
    print('男女比例'+str(statistic[mapping[college],5]))   
    number+=1

Finally save the table (contains the excel, excel2, excel3 just written)

workspace.save('报名表.xls')

Complete code:

# -*- coding: utf-8 -*-
"""
Created on Sat Nov 16 17:20:04 2019

@author: 71405
"""

import xlrd
import re
import numpy as np
college_name=xlrd.open_workbook(r'collegename.xls')
cn=college_name.sheet_by_index(0)
dictionary=dict()
mapping=dict()
map_number=0
start_row=1;
row=start_row;
while row<cn.nrows:#cn.nrows:行数
    wordset=re.split('  ',cn.cell_value(row,1))
    for word in wordset:   
        dictionary[word]=cn.cell_value(row,0)
    if cn.cell_value(row,0) not in mapping:
        mapping[cn.cell_value(row,0)]=map_number
        map_number=map_number+1
    row=row+1

signtable=xlrd.open_workbook(r'联谊活动报名表.xls')
st=signtable.sheet_by_index(0)
row=start_row;
statistic=np.zeros([len(mapping),6]);
while row<st.nrows:
    college=st.cell_value(row,9);
    find=0;
    for keyword in dictionary:
        if keyword  in college:
            find=1;
            statistic[mapping[dictionary[keyword]],int(st.cell_value(row,7)-1)]+=1
            break
    if find==0:
        print(college+'can not find')
    row=row+1
    
 
    
    
import matplotlib.pyplot as plt
labels = [str(i) for i in range((len(mapping)))]
sizes = [(i[0]+i[1])for i in statistic]

plt.pie(sizes, labels=labels, autopct='%1.1f%%',
  shadow=True, startangle=90)
plt.axis('equal') 
plt.show()

import jieba.posseg as posseg
import xlwt
workspace=xlwt.Workbook(encoding='ascii')
excel=workspace.add_sheet('报名表完整版',cell_overwrite_ok=True)
excel.write(0,0,'姓名')
excel.write(0,1,'填表时长/s')
excel.write(0,2,'学院')
excel.write(0,3,'性别')
excel.write(0,4,'学历')
excel.write(0,5,'手机号码')
excel.write(0,6,'照片上传情况')
excel.write(0,7,'介绍')
excel.write(0,8,'要求')
for row in range(1,st.nrows):
    excel.write(row,0,st.cell_value(row,6))
    excel.write(row,1,st.cell_value(row,2)[0:-1])
    temp_college=st.cell_value(row,9)
    for keyword in dictionary:
        if keyword  in temp_college:
            excel.write(row,2,dictionary[keyword])
    excel.write(row,3,st.cell_value(row,7))
    excel.write(row,4,st.cell_value(row,8))
    excel.write(row,5,st.cell_value(row,10))
    if st.cell_value(row,11)=='(空)':
        excel.write(row,6,0)
    else:
        excel.write(row,6,1)
    excel.write(row,7,st.cell_value(row,12))
    excel.write(row,8,st.cell_value(row,13))
excel2=workspace.add_sheet('报名表简洁版',cell_overwrite_ok=True)    
excel2.write(0,0,'姓名')
excel2.write(0,1,'填表时长/s')
excel2.write(0,2,'学院')
excel2.write(0,3,'性别')
excel2.write(0,4,'学历')
excel2.write(0,5,'手机号码')
excel2.write(0,6,'照片上传情况')
excel2.write(0,7,'介绍')
excel2.write(0,8,'要求')

word_type=['n','nz','ns','vn','v','a','an']
for row in range(1,st.nrows):
    temp_name=st.cell_value(row,6)
    if len(temp_name)==2:
        write_name=temp_name[0]+'*'
    elif len(temp_name)==3:
        write_name=temp_name[0]+'*'+temp_name[2]
    else:
        write_name=temp_name[0]+'**'+temp_name[3]
    excel2.write(row,0,write_name)
    excel2.write(row,1,st.cell_value(row,2)[0:-1])
    temp_college=st.cell_value(row,9)
    for keyword in dictionary:
        if keyword  in temp_college:
            excel2.write(row,2,dictionary[keyword])
    excel2.write(row,3,st.cell_value(row,7))
    excel2.write(row,4,st.cell_value(row,8))
    excel2.write(row,5,str(st.cell_value(row,10)[:3])+'****'+str(st.cell_value(row,10)[-4:]))
    if st.cell_value(row,11)=='(空)':
        excel2.write(row,6,0)
    else:
        excel2.write(row,6,1)
    words_info = posseg.cut(st.cell_value(row,12))
    words_demm = posseg.cut(st.cell_value(row,13))
    info=''
    demm=''
    for word, flag in words_info:
        if flag in word_type:
            info+=word
    for word, flag in words_demm:
        if flag in word_type:
            demm+=word
    excel2.write(row,7,info)
    excel2.write(row,8,demm)
    
excel3=workspace.add_sheet('数据统计',cell_overwrite_ok=True)
excel3.write(0,0,'学院名')
excel3.write(0,1,'报名人数')
excel3.write(0,2,'总占比')
excel3.write(0,3,'男生报名数')
excel3.write(0,4,'男生占比')
excel3.write(0,5,'女生报名数')
excel3.write(0,6,'女生占比')
excel3.write(0,7,'男女比例')
number=1
for college in mapping:
    excel3.write(number,0,college)
    boy=statistic[mapping[college],0];
    girl=statistic[mapping[college],1];
    print(college+'总报名人数'+str(boy+girl))
    excel3.write(number,1,str(boy+girl))
    excel3.write(number,3,str(boy))
    excel3.write(number,5,str(girl))
    statistic[mapping[college],2]=round((boy+girl)/st.nrows,5)
    print('占'+str(statistic[mapping[college],2]))
    statistic[mapping[college],3]=round(boy/191,3)
    statistic[mapping[college],4]=round(girl/384,3)
    if girl==0:
        girl=1
    statistic[mapping[college],5]=round(boy/girl,3)
    excel3.write(number,2,str(100*statistic[mapping[college],2])+'%')
    excel3.write(number,4,str(100*statistic[mapping[college],3])+'%')
    excel3.write(number,6,str(100*statistic[mapping[college],4])+'%')
    excel3.write(number,7,str(statistic[mapping[college],5]))
    print('男生占'+str(statistic[mapping[college],3]))
    print('女生占'+str(statistic[mapping[college],4]))
    print('男女比例'+str(statistic[mapping[college],5]))   
    number+=1
workspace.save('报名表.xls')

——————————————————————— I am the dividing line——————————————————————— —————

After the analysis and processing were completed, due to the large number of applicants, when it was impossible to screen one by one, a screening algorithm was written:

It is controlled by four conditions: the length of time to fill in the form, whether to upload photos, whether to write introduction, whether to write requirements, generally, the data of these four dimensions can indicate whether the student is enthusiastic and serious about the activity, for example, a student used 20s After completing the filling, none of the three fields that can be filled is filled out, which is too scribbled. Some students wrote a lot of content and uploaded photos. The filling time is more than 100 seconds (the longest filling time in the data is 2600s, which is probably More than 20 minutes, just look at it seriously!).


import xlrd
import xlwt
workspace=xlwt.Workbook(encoding='ascii')
excel=workspace.add_sheet('报名表筛选版',cell_overwrite_ok=True)
select_list=[1,1,1,1]#分别代表时间/照片/介绍/要求为空时是否筛选
col_num=[1,6,7,8]#四个属性所在列号
time_thre=100#阈值

table=xlrd.open_workbook(r'报名表.xls')#打开之前所写入完成的表的第一张
t=table.sheet_by_index(0)

remain=1#保留的行号
for i in range(t.ncols):#t.ncols:t的列数
    excel.write(0,i,t.cell_value(0,i))#老规矩,第一行复制表头
    
for row in range(1,t.nrows):
    state=True#表示是否保留
    if select_list[0]==1:
        if int(t.cell_value(row,col_num[0]))<time_thre:
            state=False#填写时长小于阈值的置为不保留
    for i in range(1,4):
        if select_list[i]==1:#是否分别开启照片/介绍/要求非空筛选
            if t.cell_value(row,col_num[i])=='0' or  t.cell_value(row,col_num[i])==0  or t.cell_value(row,col_num[i])=='(空)'  or t.cell_value(row,col_num[i])=='无':
                state=False#如果对应字符串为无、空、0的任意字符,则不保留
    if state:
        for i in range(t.ncols):
            excel.write(remain,i,t.cell_value(row,i))#如果以上都通过了则写入新表
        remain+=1
        
workspace.save('报名表筛选版.xls')

 

Guess you like

Origin blog.csdn.net/qq_36614557/article/details/103134597