pandas reads CSV files with different number of columns

Using pandas to read a CSV file with different columns in each row

For sequence models, the size of each piece of data is not necessarily equal, but for general neural networks, the input size is required to be equal. One current method is to select the maximum length of data in the current data set as the baseline data size, and pad zeros at the end of the remaining data to standardize the size of each piece of data in the entire data set.

This article focuses on small-scale CSV data sets, reading CSV files with different columns in each row through pandas, and finally generating data that can be used by neural networks.

PS: This article is only general in nature. For specific data sets, specific analysis of specific problems is required! !

As shown in the picture:
1

  1. Traverse the train and test files to obtain the maximum column datalargest_colum

    train_path = 'train.csv'
    test_path = 'test.csv'
    largest_colum = 0  # 数据集中最大的列数
    with open(train_path, 'r') as f:  # 遍历train.csv, 获取训练集中的最大列数
        datas = f.readlines()
        for i, l in enumerate(datas):
            largest_colum = largest_colum if largest_colum > len(l.split(',')) + 1 else len(l.split(',')) + 1
    
    with open(test_path, 'r') as f:  # 编列test.csv, 获取测试集中的最大列数
        datas = f.readlines()
        for i, l in enumerate(datas):
            largest_colum = largest_colum if largest_colum > len(l.split(',')) + 1 else len(l.split(',')) + 1
    
  2. Abandon the original csv column index and use it largest_columas an index to read the csv file

    col_name = [i for i in range(largest_colum)]  # 生成CSV数据每一列的索引
    train_data = pd.read_csv(train_path, header=None, sep=',', names=col_name,  engin='python')
    train_data = pd.read_csv(test_path, header=None, sep=',', names=col_name,  engin='python')
    

    The data after reading is:
    1

  3. Fill the data that is not long enough at the end with 0 (it does not have to be 0, it must be distinguished from the original data in the data set)

    train_data = train_data.fillna(-1)
    test_data = test_data.fillna(-1)
    
  4. Convert pandas matrix to torch tensor

    train_features = torch.tensor(train_data, dtype=torch.float32)
    test_features = torch.tensor(test_data, dtype=torch.float32)
    

Guess you like

Origin blog.csdn.net/qq_44733706/article/details/130202164