Use Numpy and Pandas to conduct actual combat analysis of subway passenger data (with source code)

If you need source code and data sets, please like and follow the collection and leave a private message in the comment area~~~

First, let's explain the similarities and differences between Numpy and Pandas

1) Numpy is an extension package for numerical computing, which can efficiently process N-dimensional arrays, that is, it will be convenient to process high-dimensional arrays or matrices. Pandas is a data analysis package for python, mainly for data processing, mainly for processing two-dimensional tables.

2) Numpy can only store ndarrays of the same type, and Pandas can handle different types of data. For example, different columns in a two-dimensional table can be different types of data, one column is an integer and the other is a string.

3) Numpy supports parallel computing, so TensorFlow2.0 and PyTorch can seamlessly convert with numpy. The bottom layer of Numpy is written in C language, which is much more efficient than pure Python code.

4) Pansdas is a Numpy-based tool created to solve data analysis tasks. Pandas provides a large number of functions and methods for quickly and conveniently processing data.

5) Pandas and Numpy can be converted to each other. You only need to use df.values ​​to convert DataFrame to ndarray, and use pd.DataFrame(array) to convert ndarray to DataFrame.

The case application of Pandas and Numpy is shown by extracting the passenger volume data of each station at a granularity of 15 minutes for three months. 

Part of the data is shown below

 

 Preliminary analysis of the data found that the data has the following characteristics:

1. The first six lines of subway data are invalid, and the seventh line gives the name of each station;

2. Each station counts the passenger flow according to the 15-minute granularity, and gives the passenger flow of entering the station, going out, and entering and leaving the station;

3. The operation time is from 2:00-23:59, which is different from the actual operation time of the subway from 5:30-23:00, and needs to be adjusted

Get the station name and station number below 

 Next define two functions, we hope to write all the data into two folders, one is "in.csv" to store the inbound data of each station, and the other is "out.csv" to store the outbound data of each station station data

def process_not_exists(f):
    # 前五行是无用数据
    file = pd.read_excel(f, skiprows = 5, skipfooter = 3, usecols = target_col)
    arr = file.values
    # 构造一个字典先存储数据
    d_in = {}
    d_out = {}
    for i in stations_index:
    # 存储第i个车站的上下客流数据
        d_in[i] = []
        d_out[i] = []
    # 5:30 之后的数据是从excel的50行开始,处理后的数据应从43行开始
    for i in range(43,len(arr)):
        l = arr[i] # 获取第i行的数据
        # 通过条件直接筛选掉“进出站”
        if l[1] == '进站':
            # 进站处理
            for j in range(2,len(l)):
                d_in[j].append(l[j])
        if l[1] == '出站':
            # 出站处理
            for j in range(2,len(l)):
                d_out[j].append(l[j])
    in_list = [] # 存储进站数据
    out_list = [] # 存储出站数据
    for key in d_in:
        # d_in 与 d_out 的key均为车站的index
        in_list.append(d_in[key])
        out_list.append(d_out[key])
        
    df_in = pd.DataFrame(in_list)
    df_in.to_csv("./data/in.csv", header = True, index = None)
    df_out = pd.DataFrame(out_list)
    df_out.to_csv("./data/out.csv", header = True, index = None)
# 目标文件存在时
def process_exists(f,target_file_in,target_file_out):
    
    infile = pd.read_csv(target_file_in)
    outfile = pd.read_csv(target_file_out)
    
    in_arr = infile.values.tolist()
    out_arr = outfile.values.tolist()
    
    # 前五行是无用数据
    file = pd.read_excel(f, skiprows = 5, skipfooter = 3, usecols = target_col)
    arr = file.values
    # 构造一个字典先存储数据
    d_in = {}
    d_out = {}
    for i in stations_index:
    # 存储第i个车站的上下客流数据
        d_in[i] = []
        d_out[i] = []
    # 5:30 之后的数据是从excel的50行开始,处理后的数据应从43行开始
    for i in range(43,len(arr)):
        l = arr[i] # 获取第i行的数据
        # 通过条件直接筛选掉“进出站”
        if l[1] == '进站':
            # 进站处理
            for j in range(2,len(l)):
                d_in[j].append(l[j])
        if l[1] == '出站':
            # 出站处理
            for j in range(2,len(l)):
                d_out[j].append(l[j])
    in_list = [] # 存储进站数据
    out_list = [] # 存储出站数据
    for key in d_in:
        # d_in 与 d_out 的key均为车站的index
        in_list.append(d_in[key])
        out_list.append(d_out[key])
        
    #合并原有数据
    for i in range(len(in_arr)):
        in_arr[i] += in_list[i]
        out_arr[i] += out_list[i]
    # in_file
    df_in = pd.DataFrame(in_arr)
    df_in.to_csv("./data/in_test.csv",mode = 'r+', header = True, index = None)
    # out_file
    df_out = pd.DataFrame(out_arr)
    df_out.to_csv("./data/out_test.csv",mode = 'r+', header = True, index = None)

Part of the code is as follows

for name in filenames:
    f = "./data/" + name
    target_file_in = "./data/in_test.csv"
    target_file_out = "./data/out_test.csv"
    # 若文件已存在
    if Path(target_file_in).exists() and Path(target_file_out).exists():
        print("exist")
        process_exists(f,target_file_in,target_file_out)
        #break
    else:
        print("not exist")
        process_not_exists(f)

print("done")
# 获取车站名和车站编号
nfile = pd.read_excel(f, skiprows = 5, skipfooter = 3, usecols = target_col)
arrt = nfile.values
stations_name = []
stations_index = []
for i in range(2,len(arrt[0])):
    stations_index.append(i)
    stations_name.append(arrt[0][i])
print(stations_name)
print(stations_index)
# 筛选掉 “合计”无用项,并设置target_col存储目标项
name = filenames[0]
f = "./data/" + name
# 前五行是无用数据
file = pd.read_excel(f, skiprows = 5, skipfooter = 3)
tarr = file.values
print(tarr[3])
test = tarr[0]
target_col = []
for i in range(len(test)):
    tmp = test[i]
    if tmp != '合计':
        target_col.append(i)
print(target_col)

It's not easy to create and find it helpful, please like, follow and collect~~~

Guess you like

Origin blog.csdn.net/jiebaoshayebuhui/article/details/130438317