/*
Prompt word:
I need to classify a piece of audio, input a piece of audio, the output judgment result is: 1. Laughter, 2. Applause, 3. Unknown. Please use tensorflow2.0 to write the program, use the convolutional neural network model, and use softmax for the mfcc feature recognition evaluation function. I now have and only have two types of audio data sets, one is laughter and the other is applause. Please give the program and data Set directory structure
Give a complete code, including the following parts:
1. Load and preprocess data
2. Build and compile the model
3. Train the model and save the model file
4. Load the model file
5. Use the model for prediction
*/
Ok, for your needs, here is a Python program written in TensorFlow 2.0 to classify audio. We will classify audio using a convolutional neural network model and MFCC feature recognition. The procedure is as follows:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint
import librosa
import os
import numpy as np
# 定义分类类别
classes = ['laugh', 'applause', 'unknown']
# 定义 MFCC 参数
num_mfcc = 20
n_fft = 2048
hop_length = 512
num_segments = 5
# 定义每个段的时间长度
segment_duration = 1
# 定义采样率
sample_rate = 44100
# 定义目录结构
train_dir = './data/train'
val_dir = './data/val'
# 加载数据并预处理
def load_data(data_dir):
# 初始化数据和标签列表
data = []
labels = []
# 处理每个类别的文件夹
for i, cls in enumerate(classes):
cls_dir = os.path.join(data_dir, cls)
for audio_file in os.listdir(cls_dir):
file_path = os.path.join(cls_dir, audio_file)
# 加载音频文件
signal, sr = librosa.load(file_path, sr=sample_rate)
# 分割音频文件
n_samples = len(signal)
samples_per_segment = int(segment_duration * sample_rate / num_segments)
num_mfcc_vectors_per_segment = math.ceil(samples_per_segment / hop_length)
for j in range(num_segments):
start_sample = samples_per_segment * j
end_sample = start_sample + samples_per_segment
mfcc = librosa.feature.mfcc(signal[start_sample:end_sample], sr=sr,
n_fft=n_fft, n_mfcc=num_mfcc,
hop_length=hop_length)
mfcc = mfcc.T
if len(mfcc) == num_mfcc_vectors_per_segment:
data.append(mfcc.tolist())
labels.append(i)
# 转换标签列表为分类矩阵
labels = to_categorical(labels, num_classes=len(classes))
return np.array(data), np.array(labels)
# 加载和预处理训练数据
train_data, train_labels = load_data(train_dir)
# 加载和预处理验证数据
val_data, val_labels = load_data(val_dir)
# 构建和编译模型
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=train_data.shape[1:]))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(classes), activation='softmax'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# 训练模型并保存模型文件
checkpoint = ModelCheckpoint('model.h5', monitor='val_accuracy', save_best_only=True, verbose=1)
history = model.fit(train_data, train_labels, epochs=50, batch_size=16,
validation_data=(val_data, val_labels),
callbacks=[checkpoint])
# 加载模型文件
model = tf.keras.models.load_model('model.h5')
# 使用模型进行预测
def predict(filename):
signal, sr = librosa.load(filename, sr=sample_rate)
# 分割音频文件
n_samples = len(signal)
samples_per_segment = int(segment_duration * sample_rate / num_segments)
num_mfcc_vectors_per_segment = math.ceil(samples_per_segment / hop_length)
segments = []
for j in range(num_segments):
start_sample = samples_per_segment * j
end_sample = start_sample + samples_per_segment
mfcc = librosa.feature.mfcc(signal[start_sample:end_sample], sr=sr,
n_fft=n_fft, n_mfcc=num_mfcc,
hop_length=hop_length)
mfcc = mfcc.T
if len(mfcc) == num_mfcc_vectors_per_segment:
segments.append(mfcc.tolist())
# 转换为 NumPy 数组形式
segments = np.array(segments)
# 预测音频类别
predictions = model.predict(segments)
predicted_class_indices = np.argmax(predictions, axis=1)
return classes[predicted_class_indices[0]]
The above code includes the following parts:
- Load and preprocess data function: This function can load an audio dataset and preprocess it, including segmenting audio and extracting MFCC features.
- Model definition and compilation: This section defines the convolutional neural network model and compiles it using softmax as the evaluation function.
- Model training and saving: This section defines the training process of the model, and saves the best model file (that is, the model with the highest verification accuracy) at the end of each epoch.
- Model Loading: Used to load the best model file saved.
- predict function: This function can accept the file path of an audio file and use the loaded model to predict its category.
The directory structure of the dataset should look like this:
- data
- train
- laugh
- audio1.wav
- audio2.wav
- ...
- applause
- audio1.wav
- audio2.wav
- ...
- laugh
- val
- laugh
- audio1.wav
- audio2.wav
- ...
- applause
- audio1.wav
- audio2.wav
- ...
- laugh
- train
where laugh
and applause
represent two different audio categories, train
and represent the training set and validation set, val
respectively . Each category folder contains multiple audio files.
The realized code, code analysis, and hints are given. What do you think after reading the process of a few seconds above! Far more than that!