Phoneme classification predicts phonemes through speech data. A phoneme is the smallest unit of sound that can distinguish meanings in a human language, and it is the basic concept of phonological analysis. Every language has its own phoneme system.

A frame is set as a 25ms long audio segment, and a frame is cut by sliding 10ms each time. Each frame is
processed by MFCC and becomes a vector of length 39. For each frame vector, the dataset provides a label. There are 41 categories of tags, each category represents a phoneme

The entire training set is a subset of the train-clean-100 dataset (LibriSpeech), with a total of 2644158 frames. After preprocessing, these frames are integrated into 4268 pt files

For example, use the load_feat function in the job code to read 19-198-0008.pt to get a tensor variable whose shape is [284, 39], find the line 19-198-0008 in the train_labels.txt file , containing a total of 284 numeric labels.

Similarly, the test set has a total of 646268 frames, which are integrated into 1078 pt files

Address: ML2022Spring-hw2 | Kaggle

Guide package

import numpy as np
import os
import random
import pandas as pd
import torch
from tqdm import tqdm
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader, TensorDataset
from d2l import torch as d2l

helper function

set seed

#fix seed
def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  
    np.random.seed(seed)  
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

data preprocessing

A phoneme may span multiple frames, so adjacent frames need to be merged to predict the phoneme of the central frame.

This work is mainly done by the concat_feat function

# 读取pt文件
def load_feat(path):
    feat = torch.load(path)
    return feat

def shift(x, n):
    if n < 0:
        left = x[0].repeat(-n, 1)
        right = x[:n]

    elif n > 0:
        right = x[-1].repeat(n, 1)
        left = x[n:]
    else:
        return x

    return torch.cat((left, right), dim=0)


def concat_feat(x, concat_n):
    assert concat_n % 2 == 1 # n must be odd
    if concat_n < 2:
        return x
    seq_len, feature_dim = x.size(0), x.size(1)
    x = x.repeat(1, concat_n) 
    x = x.view(seq_len, concat_n, feature_dim).permute(1, 0, 2) # concat_n, seq_len, feature_dim
    mid = (concat_n // 2)
    for r_idx in range(1, mid+1):
        x[mid + r_idx, :] = shift(x[mid + r_idx], r_idx)
        x[mid - r_idx, :] = shift(x[mid - r_idx], -r_idx)

    return x.permute(1, 0, 2).view(seq_len, concat_n * feature_dim)

def preprocess_data(split, feat_dir, phone_path, concat_nframes, train_ratio=0.8, train_val_seed=1337):
    class_num = 41 # NOTE: pre-computed, should not need change
    mode = 'train' if (split == 'train' or split == 'val') else 'test'

    label_dict = {}
    if mode != 'test':
      phone_file = open(os.path.join(phone_path, f'{mode}_labels.txt')).readlines()

      for line in phone_file:
          line = line.strip('\n').split(' ')
          label_dict[line[0]] = [int(p) for p in line[1:]]

    if split == 'train' or split == 'val':
        # split training and validation data
        usage_list = open(os.path.join(phone_path, 'train_split.txt')).readlines()  #获取标签列表
        random.seed(train_val_seed)  # 固定住seed，使得划分出的验证集和训练集没有交集
        random.shuffle(usage_list)
        percent = int(len(usage_list) * train_ratio)
        usage_list = usage_list[:percent] if split == 'train' else usage_list[percent:] # 划分出验证集
    elif split == 'test':
        usage_list = open(os.path.join(phone_path, 'test_split.txt')).readlines()
    else:
        raise ValueError('Invalid \'split\' argument for dataset: PhoneDataset!')

    usage_list = [line.strip('\n') for line in usage_list]
    print('[Dataset] - # phone classes: ' + str(class_num) + ', number of utterances for ' + split + ': ' + str(len(usage_list)))

    max_len = 3000000
    # X就是最终要得到的样本数据，其中每一行就是一个样本。每一行包含了concat_nframe个frame
    X = torch.empty(max_len, 39 * concat_nframes)  
    if mode != 'test':
      y = torch.empty(max_len, dtype=torch.long)  # 标签数据

    idx = 0
    for i, fname in tqdm(enumerate(usage_list)):
        feat = load_feat(os.path.join(feat_dir, mode, f'{fname}.pt'))  # 读取每一个pt文件，得到一个tensor变量
        cur_len = len(feat)  # 统计该tensor有多少行，后面对X进行截取
        feat = concat_feat(feat, concat_nframes)   # 得到以每个frame为中心，扩展了concat_nframes个邻近frame的向量。如果一个frame在边缘，则以他自身代替邻近frame进行扩展
        if mode != 'test':
          label = torch.LongTensor(label_dict[fname])  # 获取label

        X[idx: idx + cur_len, :] = feat  # 存入X
        if mode != 'test':
          y[idx: idx + cur_len] = label

        idx += cur_len
 
    X = X[:idx, :]  # X有3000000行，超出的部分不要
    if mode != 'test':
      y = y[:idx]

    print(f'[INFO] {split} set')
    print(X.shape)
    if mode != 'test':
      print(y.shape)
      return X, y
    else:
      return X

Test the concat_feat function, and you can roughly know what it is doing above

x = torch.tensor([[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]])
res = concat_feat(x, 3)
res

tensor([[ 1,  2,  3,  1,  2,  3,  4,  5,  6],
        [ 1,  2,  3,  4,  5,  6,  7,  8,  9],
        [ 4,  5,  6,  7,  8,  9, 10, 11, 12],
        [ 7,  8,  9, 10, 11, 12, 13, 14, 15],
        [10, 11, 12, 13, 14, 15, 16, 17, 18],
        [13, 14, 15, 16, 17, 18, 16, 17, 18]])

Dataset loading

import gc

def loadData(concat_nframes, train_ratio, batch_size):
    # preprocess data
    train_X, train_y = preprocess_data(split='train', feat_dir='./libriphone/feat', phone_path='./libriphone', 
                                       concat_nframes=concat_nframes, train_ratio=train_ratio)
    val_X, val_y = preprocess_data(split='val', feat_dir='./libriphone/feat', phone_path='./libriphone', 
                                   concat_nframes=concat_nframes, train_ratio=train_ratio)

    # get dataset
    train_set = TensorDataset(train_X, train_y)
    val_set = TensorDataset(val_X, val_y)
    print('训练集总长度是 {:d}, batch数量是 {:.2f}'.format(len(train_set), len(train_set)/ batch_size))
    print('验证集总长度是 {:d}, batch数量是 {:.2f}'.format(len(val_set), len(val_set)/ batch_size))
    # remove raw feature to save memory
    del train_X, train_y, val_X, val_y
    gc.collect()

    # get dataloader
    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, drop_last=True)
    val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False, drop_last=True)
    
    return train_loader, val_loader

define model

Modular definition model, Classifier can specify how many hidden layers

import torch
import torch.nn as nn
import torch.nn.functional as F

class BasicBlock(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(BasicBlock, self).__init__()

        self.block = nn.Sequential(
            nn.Linear(input_dim, output_dim),
            nn.ReLU(),
            nn.BatchNorm1d(output_dim),
            nn.Dropout(0.3)
        )

    def forward(self, x):
        x = self.block(x)
        return x


class Classifier(nn.Module):
    def __init__(self, input_dim, output_dim=41, hidden_layers=1, hidden_dim=256):
        super(Classifier, self).__init__()

        self.fc = nn.Sequential(
            BasicBlock(input_dim, hidden_dim),
            *[BasicBlock(hidden_dim, hidden_dim) for _ in range(hidden_layers)],
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        x = self.fc(x)
        return x

training function

Use AdamW optimization function, and use CosineAnnealingWarmRestarts to adjust lr. Use the d2l toolkit in the book "Hands-on Deep Learning" to draw loss and accuracy during training

def trainer(show_num, train_loader, val_loader, model, config, devices):  
    
    criterion = nn.CrossEntropyLoss() 
    optimizer = torch.optim.AdamW(model.parameters(), lr=config['learning_rate'])
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, 
                                        T_0=2, T_mult=2, eta_min=config['learning_rate']/50)
    n_epochs, best_acc, early_stop_count = config['num_epoch'], 0.0, 0
    num_batches = len(train_loader)
    
    if not os.path.isdir('./models'):
        os.mkdir('./models') # Create directory of saving models.
    legend = ['train loss', 'train acc']
    if val_loader is not None:
        legend.append('valid loss')  
        legend.append('valid acc')  
    animator = d2l.Animator(xlabel='epoch', xlim=[0, n_epochs], legend=legend)       
        
    for epoch in range(n_epochs):
        train_acc, train_loss = 0.0, 0.0
        count = 0
        
        # training
        model.train() # set the model to training mode
        for i, (data, labels) in enumerate(train_loader):
            data, labels = data.to(devices[0]), labels.to(devices[0])         

            optimizer.zero_grad() 
            outputs = model(data) 

            loss = criterion(outputs, labels)
            loss.backward() 
            optimizer.step() 

            _, train_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
            train_acc += (train_pred.detach() == labels.detach()).sum().item()
            train_loss += loss.item()
            count += 1
        
            if (i + 1) % (num_batches // show_num) == 0:
                train_acc = train_acc / count / len(data)
                train_loss = train_loss / count
                print('train_acc {:.3f}'.format(train_acc))
                animator.add(epoch  + (i + 1) / num_batches, (train_loss, train_acc, None, None)) 
                train_acc, train_loss, count = 0.0, 0.0, 0                
                
        scheduler.step()
        # validation
        if val_loader != None:
            model.eval() # set the model to evaluation mode
            val_acc, val_loss = 0.0, 0.0  
            with torch.no_grad():
                for i, (data, labels) in enumerate(val_loader):
                    data, labels = data.to(devices[0]), labels.to(devices[0])
                    outputs = model(data)

                    loss = criterion(outputs, labels) 

                    _, val_pred = torch.max(outputs, 1) 
                    val_acc += (val_pred.cpu() == labels.cpu()).sum().item() # get the index of the class with the highest probability
                    val_loss += loss.item()                

                val_acc = val_acc / len(val_loader) / len(data)
                val_loss = val_loss / len(val_loader)
                print('val_acc {:.3f}'.format(val_acc))
                animator.add(epoch + 1, (None, None, val_loss, val_acc))
                
                # if the model improves, save a checkpoint at this epoch
                if val_acc > best_acc:
                    best_acc = val_acc
                    torch.save(model.state_dict(), config['model_path'])
                    # print('saving model with acc {:.3f}'.format(best_acc / len(val_loader) / len(labels)))


    # if not validating, save the last epoch
    if val_loader == None:
        torch.save(model.state_dict(), config['model_path'])
        # print('saving model at last epoch')

Read dataset and train

read dataset

concat_nframes is an important parameter. In the class, the teaching assistant mentioned that it can be set to 11. I saw that it is more beneficial to set it larger.

Li Hongyi 2022 machine learning homework HW2 record - Zhihu : The parameters of concat_nframes should be adjusted as much as possible, but the parameters here should not be too large, too large will reduce the accuracy of the model.

concat_nframes = 17             # the number of frames to concat with, n must be odd (total 2k+1 = n frames)
train_ratio = 0.9               # the ratio of data used for training, the rest will be used for validation    
batch_size = 8192*4                # batch size
train_loader, val_loader = loadData(concat_nframes, train_ratio, batch_size)

train

devices = d2l.try_all_gpus()
print(f'DEVICE: {devices}')

# fix random seed
seed = 0                        # random seed
same_seeds(seed)

config = {
    # training prarameters
    'num_epoch': 3,                  # the number of training epoch
    'learning_rate': 1e-3,          # learning rate    

    # model parameters   
    'hidden_layers': 12,               # the number of hidden layers
    'hidden_dim': 256                  # the hidden dim     
}

config['model_path'] = './models/model' + str(config['learning_rate']) + '-' + str(config['hidden_layers']) 
        ... + '-' + str(config['hidden_dim']) + '.ckpt'  # the path where the checkpoint will be saved
input_dim = 39 * concat_nframes # the input dim of the model, you should not change the value
model = Classifier(input_dim=input_dim, hidden_layers=config['hidden_layers'], 
                   hidden_dim=config['hidden_dim'])
model = nn.DataParallel(model, device_ids = devices).to(devices[0])

trainer(5, train_loader, val_loader, model, config, devices)

Delete data in memory to save space

del train_loader, val_loader
gc.collect()

predict

prediction function

def pred(test_loader, model, devices):
    test_acc = 0.0
    test_lengths = 0
    preds = []

    model.eval()
    with torch.no_grad():
        for batch in tqdm(test_loader):
            features = batch[0].to(devices[0])
            outputs = model(features)
            _, test_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
            preds.append(test_pred.cpu())
    preds = torch.cat(pred, dim=0).numpy()
    return preds

make predictions

# load data
test_X = preprocess_data(split='test', feat_dir='./libriphone/feat', phone_path='./libriphone', concat_nframes=concat_nframes)
test_set = TensorDataset(test_X)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)


# load model
model = Classifier(input_dim=input_dim, hidden_layers=config['hidden_layers'], 
                   hidden_dim=config['hidden_dim'])
model = nn.DataParallel(model, device_ids = devices).to(devices[0])
model.load_state_dict(torch.load(config['model_path']))
pred = pred(test_loader, model, devices)


# output
with open('prediction.csv', 'w') as f:
    f.write('Id,Class\n')
    for i, y in enumerate(pred):
        f.write('{},{}\n'.format(i, y))

answer

I fixed concat_nframes to 17, and mainly adjusted the two parameters of hidden_layers and hidden_dim. If you want to reach the boss line, you need to use a more complex model. You can refer to Li Hongyi's 2022 Machine Learning HW2 Analysis_Machine Learning Craftsman's Blog-CSDN Blog_Li Hongyi Machine Learning Homework 2

hidden_layers=7，hidden_dim=256

Training process:

result:

This example is not done seriously, it is already relatively close to the Medium baseline: 0.69747

hidden_layers=12，hidden_dim=512

I tried hidden_layers=12, hidden_dim=256, and the result was limited improvement, so I increased the width each time and became 512

You can see obvious wavy lines during the training process, this is because of using CosineAnnealingWarmRestarts to adjust lr

The difference between the result and Strong baseline: 0.75028 is about 0.3%. If you continue to train, you should be able to exceed Strong baseline

discuss

wider or deeper

When changing the model parameters, there are two options, whether to deepen or widen

Li Hongyi 2022 Machine Learning HW2 Analysis_Machine Learning Craftsman's Blog-CSDN Blog_Li Hongyi Machine Learning Homework 2

Using concat_nframes=19, hidden_layers=3, hidden_dim=1024, the number of model parameters is 3,958,825, which also exceeds the Strong baseline.

I tried concat_nframes=17, hidden_layers=12, hidden_dim=512, and the number of model parameters was 3,526,185, and the score was 0.7% lower than the above model. The number of layers has changed from 3 to 12, because each time it is narrower, the number of parameters is reduced by 11%.

There are many articles mentioned that the network should be deeper, not wider (you can refer to Goodfellow Deep Learning Notes--Neural Network Architecture_iwill323's Blog-CSDN Blog_goodfellow Deep Learning ), the above comparison can be used as a confirmation.

I tried to make the model right narrow and deep, hidden_layers=18, hidden_dim=256, but the result is not good

It can be seen that the training error is relatively large, and the optimization is not good enough. It may be because after the model becomes deeper, the difficulty of optimization increases

Effect of learning rate

As a novice, I feel that it is too important to try a learning rate at the beginning, and you can see it after running a few epochs. As can be seen from the following, 1e-3 to 1e-2 is more appropriate

1e-1
1e-2
1e-3
1e-4
1e-5

The impact of batch size

The teacher made this content very clear in class, so put the picture directly:

You can refer to:

Goodfellow Flower Tree Study Notes--Optimization in Depth Model - Programmer Sought

Below I use concat_nframes=5, hidden_layers=1, hidden_dim=64, and run 3 epochs on CPU

batch size	Time-consuming (s)
8	2553
64	449
128	272
256	187

When the batch size is increased by n times, the reduction in time consumption is smaller than n times and larger than n/2 times.

In order to calculate faster, batch_size = 8192 * 4, which is considered very large. I once set the batch size very large, and found that len(val_dataloader)=0, I didn’t understand why, and then found that the batch_size was larger than 264570, and it was dropped last. After training, it is very important to find an appropriate batch size.

Li Hongyi Machine Learning Homework 2 - Phoneme Classification Prediction

data set