Table of contents
Delete data in memory to save space
data set
Phoneme classification predicts phonemes through speech data. A phoneme is the smallest unit of sound that can distinguish meanings in a human language, and it is the basic concept of phonological analysis. Every language has its own phoneme system.
A frame is set as a 25ms long audio segment, and a frame is cut by sliding 10ms each time. Each frame is
processed by MFCC and becomes a vector of length 39. For each frame vector, the dataset provides a label. There are 41 categories of tags, each category represents a phoneme
The entire training set is a subset of the train-clean-100 dataset (LibriSpeech), with a total of 2644158 frames. After preprocessing, these frames are integrated into 4268 pt files
For example, use the load_feat function in the job code to read 19-198-0008.pt to get a tensor variable whose shape is [284, 39], find the line 19-198-0008 in the train_labels.txt file , containing a total of 284 numeric labels.
Similarly, the test set has a total of 646268 frames, which are integrated into 1078 pt files
Address: ML2022Spring-hw2 | Kaggle
Guide package
import numpy as np
import os
import random
import pandas as pd
import torch
from tqdm import tqdm
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader, TensorDataset
from d2l import torch as d2l
helper function
set seed
#fix seed
def same_seeds(seed):
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
data preprocessing
A phoneme may span multiple frames, so adjacent frames need to be merged to predict the phoneme of the central frame.
This work is mainly done by the concat_feat function
# 读取pt文件
def load_feat(path):
feat = torch.load(path)
return feat
def shift(x, n):
if n < 0:
left = x[0].repeat(-n, 1)
right = x[:n]
elif n > 0:
right = x[-1].repeat(n, 1)
left = x[n:]
else:
return x
return torch.cat((left, right), dim=0)
def concat_feat(x, concat_n):
assert concat_n % 2 == 1 # n must be odd
if concat_n < 2:
return x
seq_len, feature_dim = x.size(0), x.size(1)
x = x.repeat(1, concat_n)
x = x.view(seq_len, concat_n, feature_dim).permute(1, 0, 2) # concat_n, seq_len, feature_dim
mid = (concat_n // 2)
for r_idx in range(1, mid+1):
x[mid + r_idx, :] = shift(x[mid + r_idx], r_idx)
x[mid - r_idx, :] = shift(x[mid - r_idx], -r_idx)
return x.permute(1, 0, 2).view(seq_len, concat_n * feature_dim)
def preprocess_data(split, feat_dir, phone_path, concat_nframes, train_ratio=0.8, train_val_seed=1337):
class_num = 41 # NOTE: pre-computed, should not need change
mode = 'train' if (split == 'train' or split == 'val') else 'test'
label_dict = {}
if mode != 'test':
phone_file = open(os.path.join(phone_path, f'{mode}_labels.txt')).readlines()
for line in phone_file:
line = line.strip('\n').split(' ')
label_dict[line[0]] = [int(p) for p in line[1:]]
if split == 'train' or split == 'val':
# split training and validation data
usage_list = open(os.path.join(phone_path, 'train_split.txt')).readlines() #获取标签列表
random.seed(train_val_seed) # 固定住seed,使得划分出的验证集和训练集没有交集
random.shuffle(usage_list)
percent = int(len(usage_list) * train_ratio)
usage_list = usage_list[:percent] if split == 'train' else usage_list[percent:] # 划分出验证集
elif split == 'test':
usage_list = open(os.path.join(phone_path, 'test_split.txt')).readlines()
else:
raise ValueError('Invalid \'split\' argument for dataset: PhoneDataset!')
usage_list = [line.strip('\n') for line in usage_list]
print('[Dataset] - # phone classes: ' + str(class_num) + ', number of utterances for ' + split + ': ' + str(len(usage_list)))
max_len = 3000000
# X就是最终要得到的样本数据,其中每一行就是一个样本。每一行包含了concat_nframe个frame
X = torch.empty(max_len, 39 * concat_nframes)
if mode != 'test':
y = torch.empty(max_len, dtype=torch.long) # 标签数据
idx = 0
for i, fname in tqdm(enumerate(usage_list)):
feat = load_feat(os.path.join(feat_dir, mode, f'{fname}.pt')) # 读取每一个pt文件,得到一个tensor变量
cur_len = len(feat) # 统计该tensor有多少行,后面对X进行截取
feat = concat_feat(feat, concat_nframes) # 得到以每个frame为中心,扩展了concat_nframes个邻近frame的向量。如果一个frame在边缘,则以他自身代替邻近frame进行扩展
if mode != 'test':
label = torch.LongTensor(label_dict[fname]) # 获取label
X[idx: idx + cur_len, :] = feat # 存入X
if mode != 'test':
y[idx: idx + cur_len] = label
idx += cur_len
X = X[:idx, :] # X有3000000行,超出的部分不要
if mode != 'test':
y = y[:idx]
print(f'[INFO] {split} set')
print(X.shape)
if mode != 'test':
print(y.shape)
return X, y
else:
return X
Test the concat_feat function, and you can roughly know what it is doing above
x = torch.tensor([[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]])
res = concat_feat(x, 3)
res
tensor([[ 1, 2, 3, 1, 2, 3, 4, 5, 6], [ 1, 2, 3, 4, 5, 6, 7, 8, 9], [ 4, 5, 6, 7, 8, 9, 10, 11, 12], [ 7, 8, 9, 10, 11, 12, 13, 14, 15], [10, 11, 12, 13, 14, 15, 16, 17, 18], [13, 14, 15, 16, 17, 18, 16, 17, 18]])
Dataset loading
import gc
def loadData(concat_nframes, train_ratio, batch_size):
# preprocess data
train_X, train_y = preprocess_data(split='train', feat_dir='./libriphone/feat', phone_path='./libriphone',
concat_nframes=concat_nframes, train_ratio=train_ratio)
val_X, val_y = preprocess_data(split='val', feat_dir='./libriphone/feat', phone_path='./libriphone',
concat_nframes=concat_nframes, train_ratio=train_ratio)
# get dataset
train_set = TensorDataset(train_X, train_y)
val_set = TensorDataset(val_X, val_y)
print('训练集总长度是 {:d}, batch数量是 {:.2f}'.format(len(train_set), len(train_set)/ batch_size))
print('验证集总长度是 {:d}, batch数量是 {:.2f}'.format(len(val_set), len(val_set)/ batch_size))
# remove raw feature to save memory
del train_X, train_y, val_X, val_y
gc.collect()
# get dataloader
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False, drop_last=True)
return train_loader, val_loader
define model
Modular definition model, Classifier can specify how many hidden layers
import torch
import torch.nn as nn
import torch.nn.functional as F
class BasicBlock(nn.Module):
def __init__(self, input_dim, output_dim):
super(BasicBlock, self).__init__()
self.block = nn.Sequential(
nn.Linear(input_dim, output_dim),
nn.ReLU(),
nn.BatchNorm1d(output_dim),
nn.Dropout(0.3)
)
def forward(self, x):
x = self.block(x)
return x
class Classifier(nn.Module):
def __init__(self, input_dim, output_dim=41, hidden_layers=1, hidden_dim=256):
super(Classifier, self).__init__()
self.fc = nn.Sequential(
BasicBlock(input_dim, hidden_dim),
*[BasicBlock(hidden_dim, hidden_dim) for _ in range(hidden_layers)],
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x):
x = self.fc(x)
return x
training function
Use AdamW optimization function, and use CosineAnnealingWarmRestarts to adjust lr. Use the d2l toolkit in the book "Hands-on Deep Learning" to draw loss and accuracy during training
def trainer(show_num, train_loader, val_loader, model, config, devices):
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=config['learning_rate'])
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer,
T_0=2, T_mult=2, eta_min=config['learning_rate']/50)
n_epochs, best_acc, early_stop_count = config['num_epoch'], 0.0, 0
num_batches = len(train_loader)
if not os.path.isdir('./models'):
os.mkdir('./models') # Create directory of saving models.
legend = ['train loss', 'train acc']
if val_loader is not None:
legend.append('valid loss')
legend.append('valid acc')
animator = d2l.Animator(xlabel='epoch', xlim=[0, n_epochs], legend=legend)
for epoch in range(n_epochs):
train_acc, train_loss = 0.0, 0.0
count = 0
# training
model.train() # set the model to training mode
for i, (data, labels) in enumerate(train_loader):
data, labels = data.to(devices[0]), labels.to(devices[0])
optimizer.zero_grad()
outputs = model(data)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
_, train_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
train_acc += (train_pred.detach() == labels.detach()).sum().item()
train_loss += loss.item()
count += 1
if (i + 1) % (num_batches // show_num) == 0:
train_acc = train_acc / count / len(data)
train_loss = train_loss / count
print('train_acc {:.3f}'.format(train_acc))
animator.add(epoch + (i + 1) / num_batches, (train_loss, train_acc, None, None))
train_acc, train_loss, count = 0.0, 0.0, 0
scheduler.step()
# validation
if val_loader != None:
model.eval() # set the model to evaluation mode
val_acc, val_loss = 0.0, 0.0
with torch.no_grad():
for i, (data, labels) in enumerate(val_loader):
data, labels = data.to(devices[0]), labels.to(devices[0])
outputs = model(data)
loss = criterion(outputs, labels)
_, val_pred = torch.max(outputs, 1)
val_acc += (val_pred.cpu() == labels.cpu()).sum().item() # get the index of the class with the highest probability
val_loss += loss.item()
val_acc = val_acc / len(val_loader) / len(data)
val_loss = val_loss / len(val_loader)
print('val_acc {:.3f}'.format(val_acc))
animator.add(epoch + 1, (None, None, val_loss, val_acc))
# if the model improves, save a checkpoint at this epoch
if val_acc > best_acc:
best_acc = val_acc
torch.save(model.state_dict(), config['model_path'])
# print('saving model with acc {:.3f}'.format(best_acc / len(val_loader) / len(labels)))
# if not validating, save the last epoch
if val_loader == None:
torch.save(model.state_dict(), config['model_path'])
# print('saving model at last epoch')
Read dataset and train
read dataset
concat_nframes is an important parameter. In the class, the teaching assistant mentioned that it can be set to 11. I saw that it is more beneficial to set it larger.
Li Hongyi 2022 machine learning homework HW2 record - Zhihu : The parameters of concat_nframes should be adjusted as much as possible, but the parameters here should not be too large, too large will reduce the accuracy of the model.
concat_nframes = 17 # the number of frames to concat with, n must be odd (total 2k+1 = n frames)
train_ratio = 0.9 # the ratio of data used for training, the rest will be used for validation
batch_size = 8192*4 # batch size
train_loader, val_loader = loadData(concat_nframes, train_ratio, batch_size)
train
devices = d2l.try_all_gpus()
print(f'DEVICE: {devices}')
# fix random seed
seed = 0 # random seed
same_seeds(seed)
config = {
# training prarameters
'num_epoch': 3, # the number of training epoch
'learning_rate': 1e-3, # learning rate
# model parameters
'hidden_layers': 12, # the number of hidden layers
'hidden_dim': 256 # the hidden dim
}
config['model_path'] = './models/model' + str(config['learning_rate']) + '-' + str(config['hidden_layers'])
... + '-' + str(config['hidden_dim']) + '.ckpt' # the path where the checkpoint will be saved
input_dim = 39 * concat_nframes # the input dim of the model, you should not change the value
model = Classifier(input_dim=input_dim, hidden_layers=config['hidden_layers'],
hidden_dim=config['hidden_dim'])
model = nn.DataParallel(model, device_ids = devices).to(devices[0])
trainer(5, train_loader, val_loader, model, config, devices)
Delete data in memory to save space
del train_loader, val_loader
gc.collect()
predict
prediction function
def pred(test_loader, model, devices):
test_acc = 0.0
test_lengths = 0
preds = []
model.eval()
with torch.no_grad():
for batch in tqdm(test_loader):
features = batch[0].to(devices[0])
outputs = model(features)
_, test_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
preds.append(test_pred.cpu())
preds = torch.cat(pred, dim=0).numpy()
return preds
make predictions
# load data
test_X = preprocess_data(split='test', feat_dir='./libriphone/feat', phone_path='./libriphone', concat_nframes=concat_nframes)
test_set = TensorDataset(test_X)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
# load model
model = Classifier(input_dim=input_dim, hidden_layers=config['hidden_layers'],
hidden_dim=config['hidden_dim'])
model = nn.DataParallel(model, device_ids = devices).to(devices[0])
model.load_state_dict(torch.load(config['model_path']))
pred = pred(test_loader, model, devices)
# output
with open('prediction.csv', 'w') as f:
f.write('Id,Class\n')
for i, y in enumerate(pred):
f.write('{},{}\n'.format(i, y))
answer
I fixed concat_nframes to 17, and mainly adjusted the two parameters of hidden_layers and hidden_dim. If you want to reach the boss line, you need to use a more complex model. You can refer to Li Hongyi's 2022 Machine Learning HW2 Analysis_Machine Learning Craftsman's Blog-CSDN Blog_Li Hongyi Machine Learning Homework 2
hidden_layers=7,hidden_dim=256
Training process:
result:
This example is not done seriously, it is already relatively close to the Medium baseline: 0.69747
hidden_layers=12,hidden_dim=512
I tried hidden_layers=12, hidden_dim=256, and the result was limited improvement, so I increased the width each time and became 512
You can see obvious wavy lines during the training process, this is because of using CosineAnnealingWarmRestarts to adjust lr
The difference between the result and Strong baseline: 0.75028 is about 0.3%. If you continue to train, you should be able to exceed Strong baseline
discuss
wider or deeper
When changing the model parameters, there are two options, whether to deepen or widen
Using concat_nframes=19, hidden_layers=3, hidden_dim=1024, the number of model parameters is 3,958,825, which also exceeds the Strong baseline.
I tried concat_nframes=17, hidden_layers=12, hidden_dim=512, and the number of model parameters was 3,526,185, and the score was 0.7% lower than the above model. The number of layers has changed from 3 to 12, because each time it is narrower, the number of parameters is reduced by 11%.
There are many articles mentioned that the network should be deeper, not wider (you can refer to Goodfellow Deep Learning Notes--Neural Network Architecture_iwill323's Blog-CSDN Blog_goodfellow Deep Learning ), the above comparison can be used as a confirmation.
I tried to make the model right narrow and deep, hidden_layers=18, hidden_dim=256, but the result is not good
It can be seen that the training error is relatively large, and the optimization is not good enough. It may be because after the model becomes deeper, the difficulty of optimization increases
Effect of learning rate
As a novice, I feel that it is too important to try a learning rate at the beginning, and you can see it after running a few epochs. As can be seen from the following, 1e-3 to 1e-2 is more appropriate
1e-1 | |
1e-2 | |
1e-3 | |
1e-4 | |
1e-5 |
The impact of batch size
The teacher made this content very clear in class, so put the picture directly:
You can refer to:
Goodfellow Flower Tree Study Notes--Optimization in Depth Model - Programmer Sought
Below I use concat_nframes=5, hidden_layers=1, hidden_dim=64, and run 3 epochs on CPU
batch size | Time-consuming (s) |
8 | 2553 |
64 | 449 |
128 | 272 |
256 | 187 |
When the batch size is increased by n times, the reduction in time consumption is smaller than n times and larger than n/2 times.
In order to calculate faster, batch_size = 8192 * 4, which is considered very large. I once set the batch size very large, and found that len(val_dataloader)=0, I didn’t understand why, and then found that the batch_size was larger than 264570, and it was dropped last. After training, it is very important to find an appropriate batch size.