25msの長さのオーディオセグメントをフレームとして設定し、10msずつスライドさせてフレームを切り出します。各フレームは
MFCC によって処理され、長さ 39 のベクトルになります。データセットはフレームベクトルごとにラベルを提供します。タグには 41 のカテゴリがあり、各カテゴリは音素を表します

トレーニングセット全体は train-clean-100 データセット (LibriSpeech) のサブセットであり、合計 2644158 フレームが含まれています。前処理後、これらのフレームは 4268 ポイントのファイルに統合されます

たとえば、ジョブコードで load_feat 関数を使用して 19-198-0008.pt を読み取り、形状が [284, 39] であるテンソル変数を取得します。 train_labels.txt ファイルで 19-198-0008 という行を見つけます。合計 284 個の数値ラベル。

同様に、テストセットには合計 646268 フレームがあり、1078 pt ファイルに統合されています。

アドレス: ML2022Spring-hw2 | Kaggle

ガイドパッケージ

import numpy as np
import os
import random
import pandas as pd
import torch
from tqdm import tqdm
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader, TensorDataset
from d2l import torch as d2l

ヘルパー関数

シードを設定する

#fix seed
def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  
    np.random.seed(seed)  
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

データの前処理

音素は複数のフレームにまたがる場合があるため、中央のフレームの音素を予測するには隣接するフレームをマージする必要があります。

この作業は主に concat_feat 関数によって行われます。

# 读取pt文件
def load_feat(path):
    feat = torch.load(path)
    return feat

def shift(x, n):
    if n < 0:
        left = x[0].repeat(-n, 1)
        right = x[:n]

    elif n > 0:
        right = x[-1].repeat(n, 1)
        left = x[n:]
    else:
        return x

    return torch.cat((left, right), dim=0)


def concat_feat(x, concat_n):
    assert concat_n % 2 == 1 # n must be odd
    if concat_n < 2:
        return x
    seq_len, feature_dim = x.size(0), x.size(1)
    x = x.repeat(1, concat_n) 
    x = x.view(seq_len, concat_n, feature_dim).permute(1, 0, 2) # concat_n, seq_len, feature_dim
    mid = (concat_n // 2)
    for r_idx in range(1, mid+1):
        x[mid + r_idx, :] = shift(x[mid + r_idx], r_idx)
        x[mid - r_idx, :] = shift(x[mid - r_idx], -r_idx)

    return x.permute(1, 0, 2).view(seq_len, concat_n * feature_dim)

def preprocess_data(split, feat_dir, phone_path, concat_nframes, train_ratio=0.8, train_val_seed=1337):
    class_num = 41 # NOTE: pre-computed, should not need change
    mode = 'train' if (split == 'train' or split == 'val') else 'test'

    label_dict = {}
    if mode != 'test':
      phone_file = open(os.path.join(phone_path, f'{mode}_labels.txt')).readlines()

      for line in phone_file:
          line = line.strip('\n').split(' ')
          label_dict[line[0]] = [int(p) for p in line[1:]]

    if split == 'train' or split == 'val':
        # split training and validation data
        usage_list = open(os.path.join(phone_path, 'train_split.txt')).readlines()  #获取标签列表
        random.seed(train_val_seed)  # 固定住seed，使得划分出的验证集和训练集没有交集
        random.shuffle(usage_list)
        percent = int(len(usage_list) * train_ratio)
        usage_list = usage_list[:percent] if split == 'train' else usage_list[percent:] # 划分出验证集
    elif split == 'test':
        usage_list = open(os.path.join(phone_path, 'test_split.txt')).readlines()
    else:
        raise ValueError('Invalid \'split\' argument for dataset: PhoneDataset!')

    usage_list = [line.strip('\n') for line in usage_list]
    print('[Dataset] - # phone classes: ' + str(class_num) + ', number of utterances for ' + split + ': ' + str(len(usage_list)))

    max_len = 3000000
    # X就是最终要得到的样本数据，其中每一行就是一个样本。每一行包含了concat_nframe个frame
    X = torch.empty(max_len, 39 * concat_nframes)  
    if mode != 'test':
      y = torch.empty(max_len, dtype=torch.long)  # 标签数据

    idx = 0
    for i, fname in tqdm(enumerate(usage_list)):
        feat = load_feat(os.path.join(feat_dir, mode, f'{fname}.pt'))  # 读取每一个pt文件，得到一个tensor变量
        cur_len = len(feat)  # 统计该tensor有多少行，后面对X进行截取
        feat = concat_feat(feat, concat_nframes)   # 得到以每个frame为中心，扩展了concat_nframes个邻近frame的向量。如果一个frame在边缘，则以他自身代替邻近frame进行扩展
        if mode != 'test':
          label = torch.LongTensor(label_dict[fname])  # 获取label

        X[idx: idx + cur_len, :] = feat  # 存入X
        if mode != 'test':
          y[idx: idx + cur_len] = label

        idx += cur_len
 
    X = X[:idx, :]  # X有3000000行，超出的部分不要
    if mode != 'test':
      y = y[:idx]

    print(f'[INFO] {split} set')
    print(X.shape)
    if mode != 'test':
      print(y.shape)
      return X, y
    else:
      return X

concat_feat 関数をテストすると、上記で何をしているのか大まかに知ることができます。

x = torch.tensor([[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]])
res = concat_feat(x, 3)
res

テンソル([[ 1, 2, 3, 1, 2, 3, 4, 5, 6], [ 
        1, 2, 3, 4, 5, 6, 7, 8, 9], 
        [ 4, 5, 6, 7、8、9、10、11、12]、
        [7、8、9、10、11、12、13、14、15]、[10、11、12、13、14、15、16、17 
        、 18]、
        [13、14、15、16、17、18、16、17、18]])

データセットの読み込み

import gc

def loadData(concat_nframes, train_ratio, batch_size):
    # preprocess data
    train_X, train_y = preprocess_data(split='train', feat_dir='./libriphone/feat', phone_path='./libriphone', 
                                       concat_nframes=concat_nframes, train_ratio=train_ratio)
    val_X, val_y = preprocess_data(split='val', feat_dir='./libriphone/feat', phone_path='./libriphone', 
                                   concat_nframes=concat_nframes, train_ratio=train_ratio)

    # get dataset
    train_set = TensorDataset(train_X, train_y)
    val_set = TensorDataset(val_X, val_y)
    print('训练集总长度是 {:d}, batch数量是 {:.2f}'.format(len(train_set), len(train_set)/ batch_size))
    print('验证集总长度是 {:d}, batch数量是 {:.2f}'.format(len(val_set), len(val_set)/ batch_size))
    # remove raw feature to save memory
    del train_X, train_y, val_X, val_y
    gc.collect()

    # get dataloader
    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, drop_last=True)
    val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False, drop_last=True)
    
    return train_loader, val_loader

モデルを定義する

モジュール式定義モデル、分類子は隠れ層の数を指定できます

import torch
import torch.nn as nn
import torch.nn.functional as F

class BasicBlock(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(BasicBlock, self).__init__()

        self.block = nn.Sequential(
            nn.Linear(input_dim, output_dim),
            nn.ReLU(),
            nn.BatchNorm1d(output_dim),
            nn.Dropout(0.3)
        )

    def forward(self, x):
        x = self.block(x)
        return x


class Classifier(nn.Module):
    def __init__(self, input_dim, output_dim=41, hidden_layers=1, hidden_dim=256):
        super(Classifier, self).__init__()

        self.fc = nn.Sequential(
            BasicBlock(input_dim, hidden_dim),
            *[BasicBlock(hidden_dim, hidden_dim) for _ in range(hidden_layers)],
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        x = self.fc(x)
        return x

トレーニング機能

AdamW最適化関数を使用し、CosineAnnealingWarmRestartsを使用してlrを調整します。書籍「ハンズオンディープラーニング」の d2l ツールキットを使用して、トレーニング中の損失と精度を描画します

def trainer(show_num, train_loader, val_loader, model, config, devices):  
    
    criterion = nn.CrossEntropyLoss() 
    optimizer = torch.optim.AdamW(model.parameters(), lr=config['learning_rate'])
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, 
                                        T_0=2, T_mult=2, eta_min=config['learning_rate']/50)
    n_epochs, best_acc, early_stop_count = config['num_epoch'], 0.0, 0
    num_batches = len(train_loader)
    
    if not os.path.isdir('./models'):
        os.mkdir('./models') # Create directory of saving models.
    legend = ['train loss', 'train acc']
    if val_loader is not None:
        legend.append('valid loss')  
        legend.append('valid acc')  
    animator = d2l.Animator(xlabel='epoch', xlim=[0, n_epochs], legend=legend)       
        
    for epoch in range(n_epochs):
        train_acc, train_loss = 0.0, 0.0
        count = 0
        
        # training
        model.train() # set the model to training mode
        for i, (data, labels) in enumerate(train_loader):
            data, labels = data.to(devices[0]), labels.to(devices[0])         

            optimizer.zero_grad() 
            outputs = model(data) 

            loss = criterion(outputs, labels)
            loss.backward() 
            optimizer.step() 

            _, train_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
            train_acc += (train_pred.detach() == labels.detach()).sum().item()
            train_loss += loss.item()
            count += 1
        
            if (i + 1) % (num_batches // show_num) == 0:
                train_acc = train_acc / count / len(data)
                train_loss = train_loss / count
                print('train_acc {:.3f}'.format(train_acc))
                animator.add(epoch  + (i + 1) / num_batches, (train_loss, train_acc, None, None)) 
                train_acc, train_loss, count = 0.0, 0.0, 0                
                
        scheduler.step()
        # validation
        if val_loader != None:
            model.eval() # set the model to evaluation mode
            val_acc, val_loss = 0.0, 0.0  
            with torch.no_grad():
                for i, (data, labels) in enumerate(val_loader):
                    data, labels = data.to(devices[0]), labels.to(devices[0])
                    outputs = model(data)

                    loss = criterion(outputs, labels) 

                    _, val_pred = torch.max(outputs, 1) 
                    val_acc += (val_pred.cpu() == labels.cpu()).sum().item() # get the index of the class with the highest probability
                    val_loss += loss.item()                

                val_acc = val_acc / len(val_loader) / len(data)
                val_loss = val_loss / len(val_loader)
                print('val_acc {:.3f}'.format(val_acc))
                animator.add(epoch + 1, (None, None, val_loss, val_acc))
                
                # if the model improves, save a checkpoint at this epoch
                if val_acc > best_acc:
                    best_acc = val_acc
                    torch.save(model.state_dict(), config['model_path'])
                    # print('saving model with acc {:.3f}'.format(best_acc / len(val_loader) / len(labels)))


    # if not validating, save the last epoch
    if val_loader == None:
        torch.save(model.state_dict(), config['model_path'])
        # print('saving model at last epoch')

データセットを読んでトレーニングする

データセットの読み取り

concat_nframes は重要なパラメータです。授業中にティーチングアシスタントが「11まで設定できる」とおっしゃっていましたが、大きく設定したほうが有利であることがわかりました。

Li Honyi 2022 機械学習宿題 HW2 記録 - Zhihu : concat_nframes のパラメーターは可能な限り調整する必要がありますが、ここでのパラメーターは大きすぎてはいけません。大きすぎるとモデルの精度が低下します。

concat_nframes = 17             # the number of frames to concat with, n must be odd (total 2k+1 = n frames)
train_ratio = 0.9               # the ratio of data used for training, the rest will be used for validation    
batch_size = 8192*4                # batch size
train_loader, val_loader = loadData(concat_nframes, train_ratio, batch_size)

訓練

devices = d2l.try_all_gpus()
print(f'DEVICE: {devices}')

# fix random seed
seed = 0                        # random seed
same_seeds(seed)

config = {
    # training prarameters
    'num_epoch': 3,                  # the number of training epoch
    'learning_rate': 1e-3,          # learning rate    

    # model parameters   
    'hidden_layers': 12,               # the number of hidden layers
    'hidden_dim': 256                  # the hidden dim     
}

config['model_path'] = './models/model' + str(config['learning_rate']) + '-' + str(config['hidden_layers']) 
        ... + '-' + str(config['hidden_dim']) + '.ckpt'  # the path where the checkpoint will be saved
input_dim = 39 * concat_nframes # the input dim of the model, you should not change the value
model = Classifier(input_dim=input_dim, hidden_layers=config['hidden_layers'], 
                   hidden_dim=config['hidden_dim'])
model = nn.DataParallel(model, device_ids = devices).to(devices[0])

trainer(5, train_loader, val_loader, model, config, devices)

メモリ内のデータを削除してスペースを節約する

del train_loader, val_loader
gc.collect()

予測する

予測機能

def pred(test_loader, model, devices):
    test_acc = 0.0
    test_lengths = 0
    preds = []

    model.eval()
    with torch.no_grad():
        for batch in tqdm(test_loader):
            features = batch[0].to(devices[0])
            outputs = model(features)
            _, test_pred = torch.max(outputs, 1) # get the index of the class with the highest probability
            preds.append(test_pred.cpu())
    preds = torch.cat(pred, dim=0).numpy()
    return preds

予測を行います

# load data
test_X = preprocess_data(split='test', feat_dir='./libriphone/feat', phone_path='./libriphone', concat_nframes=concat_nframes)
test_set = TensorDataset(test_X)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)


# load model
model = Classifier(input_dim=input_dim, hidden_layers=config['hidden_layers'], 
                   hidden_dim=config['hidden_dim'])
model = nn.DataParallel(model, device_ids = devices).to(devices[0])
model.load_state_dict(torch.load(config['model_path']))
pred = pred(test_loader, model, devices)


# output
with open('prediction.csv', 'w') as f:
    f.write('Id,Class\n')
    for i, y in enumerate(pred):
        f.write('{},{}\n'.format(i, y))

答え

concat_nframesを17に固定し、主にhidden_layersとhidden_dimの2つのパラメータを調整しました。ボスラインに到達したい場合は、より複雑なモデルを使用する必要があります。Li Honyi の 2022 年機械学習 HW2 分析_機械学習職人のブログ - CSDN ブログ_Li Honyi 機械学習の宿題 2 を参照してください。

hidden_layers=7、hidden_dim=256

トレーニングプロセス:

結果：

この例は真剣に行われたものではなく、すでに中ベースラインである 0.69747 に比較的近い値になっています。

hidden_layers=12、hidden_dim=512

hidden_layers=12、hidden_dim=256を試してみましたが、改善が限定的だったので、毎回幅を増やして512になりました

トレーニングプロセス中に明らかな波線が見られます。これは、CosineAnnealingWarmRestarts を使用して lr を調整しているためです。

結果と Strong ベースライン: 0.75028 の差は約 0.3% であり、トレーニングを続ければ Strong ベースラインを超えることができるはずです

議論

より広くまたはより深く

モデルパラメータを変更する場合、深くするか広げるかの 2 つのオプションがあります。

李紅宜 2022 機械学習 HW2 分析_機械学習職人のブログ-CSDN ブログ_李紅宜機械学習の宿題 2

concat_nframes=19、hidden_layers=3、hidden_dim=1024 を使用すると、モデルパラメーターの数は 3,958,825 となり、これも強いベースラインを超えます。

concat_nframes=17、hidden_layers=12、hidden_dim=512 で試したところ、モデルパラメータ数は 3,526,185 となり、スコアは上記モデルより 0.7% 低くなりました。レイヤーの数が 3 から 12 に変更されました。これは、レイヤーが狭くなるたびにパラメーターの数が 11% ずつ減少するためです。

ネットワークは広くするのではなく、より深くする必要があると言及されている記事がたくさんあります ( Goodfellow Deep Learning Notes--Neural Network Architecture_iwill323 のブログ - CSDN Blog_goodfellow Deep Learningを参照してください)。上記の比較は確認として使用できます。

モデルを正しく狭く深く、hidden_layers=18、hidden_dim=256 にしようとしましたが、結果は良くありませんでした。

トレーニング誤差が比較的大きく、最適化が十分ではないことがわかります。モデルが深くなると最適化の難易度が上がるからかもしれません

学習率の影響

初心者としては、最初に学習率を試すことが非常に重要であると感じます。数エポックを実行すると、学習率がわかります。以下からわかるように、1e-3 ～ 1e-2 の方が適切です

1e-1
1e-2
1e-3
1e-4
1e-5

バッチサイズの影響

先生はこの内容を授業で非常に明確に説明したので、写真を直接貼り付けてください。

以下を参照できます。

Goodfellow Flower Tree 研究ノート -- 詳細なモデルの最適化 - プログラマーが求めた

以下では concat_nframes=5、hidden_layers=1、hidden_dim=64 を使用し、CPU で 3 エポックを実行します。

バッチサイズ	時間がかかる
8	2553
64	449
128	272
256	187

バッチサイズが n 倍に増加した場合、消費時間の削減は n 倍よりも小さく、n/2 倍よりも大きくなります。

計算を速くするには、batch_size = 8192 * 4 となりますが、これは非常に大きいと考えられます。一度バッチサイズを非常に大きく設定したところ、len(val_dataloader)=0 になったことがわかりました。理由はわかりませんでしたが、その後、batch_size が 264570 よりも大きいことがわかり、最後に削除されました。トレーニング後は、適切なバッチサイズを見つけることが非常に重要です。

Li Honyi 機械学習の宿題 2 - 音素分類予測

データセット

ガイドパッケージ

ヘルパー関数

シードを設定する

データの前処理

データセットの読み込み

モデルを定義する

トレーニング機能

データセットを読んでトレーニングする

データセットの読み取り

訓練

メモリ内のデータを削除してスペースを節約する

予測する

予測機能

予測を行います

答え

hidden_layers=7、hidden_dim=256

hidden_layers=12、hidden_dim=512

議論

より広くまたはより深く

学習率の影響

バッチサイズの影響

おすすめ

Li Honyi 機械学習の宿題 2 - 音素分類予測

データセット

ガイドパッケージ

ヘルパー関数

シードを設定する

データの前処理

データセットの読み込み

モデルを定義する

トレーニング機能

データセットを読んでトレーニングする

データセットの読み取り

訓練

メモリ内のデータを削除してスペースを節約する

予測する

予測機能

予測を行います

答え

hidden_​​layers=7、hidden_​​dim=256

hidden_​​layers=12、hidden_​​dim=512

議論

より広くまたはより深く

学習率の影響

バッチサイズの影響

おすすめ

hidden_layers=7、hidden_dim=256

hidden_layers=12、hidden_dim=512