Summary and sharing of the 2nd "Malanshan" Cup International Audio and Video Algorithm Competition Music Beat Detection Questions (Rank7)

  Participated in the second "Malanshan" Cup music beat test competition, and finally won the 7th place. After the game, you must make a good summary. Only by carefully sorting out and summarizing can you accumulate useful knowledge.
  I entered the competition in a hurry this time, and I only entered in the middle of the competition schedule. It took a total of three weeks. Due to the tight time, I actually did not understand many problems in the field of music computing. Fortunately, I directly found two new papers. It is very consistent with the open source code and this question. I will use it directly after changing it. I did not expect to squeeze into the front row at the end. I am lucky.
  Competition link: https://challenge.ai.mgtv.com/contest/detail/10
  Two documents mentioned:
  [1] Drum-Aware Ensemble Architecture for Improved Joint Musical Beat and Downbeat Tracking
  [2] Source Separation-based Data Augmentation for Improved Joint Beat and Downbeat Tracking

1. Introduction to the competition

  The regular alternation of strong and weak in music is called beat. We usually describe beat as the pulse of a piece of music. Beat detection is the basic work of computational musicology. Many new tricks can be played on the basis of beat detection. This question requires detecting the starting time point position of the beat and downbeat in a given piece of music.
  The first dataset provided by the competition is the GTZAN dataset published online: http://marsyas.info/downloads/datasets.html. There are 1000 music clips in this dataset, and each clip is about 30s. The second is a data set made by Mango TV. There are 100 songs in the training set, 50 songs in the test set A, and 50 songs in the B list. Both the training set are marked with beat and downbeat time points. The competition does not allow the use of other external datasets and pretrained models.

2. Algorithm introduction

  At present, neural network algorithms are mainly used for music beat detection. It is said that some teams in this competition also use machine learning algorithms, and there is still some gap between the results and the front row. The code I use is modified on the basis of literature [2], and it also uses a neural network. This algorithm can be understood in the following steps:
(1) The input of the network. The input of this algorithm is not the original audio data, but the spectrum of the audio data and the first-order difference of the spectrum, so it needs to be processed first. The madmom package is used for processing, and the method can refer to the original open source code. The feature obtained after processing is a two-dimensional array. The horizontal axis represents the time at intervals of 0.01s. For example, 30s long music just generates 3000 long features. The vertical axis represents various spectral features. I used a total of 322 dimensions. As mentioned in part (5) below, the network uses the idea of ​​sound source separation to separate the original sound into drum components and non-drum components, so the drum components and non-drum components must also be packaged with madmom first. Processed into the spectral feature input network, these two inputs are used to calculate the loss function with the output of the OU subnetwork of the network to train the OU subnetwork. So there are a total of three inputs to the network, the features of the original sound, the features of the drum component and the features of no drum component.
(2) LABELS. The label is to embed beat and downbeat into a one-dimensional array with an interval of 0.01s. For example, 30s of music also corresponds to a 3000-long label array. The time point with beat is marked as 1, the time point with downbeat is marked as 2, and nothing is marked as 0. Of course, there are a lot of 0, because beat and downbeat are just a point, they are sparse, usually the ratio of 0, 1, 2 is about 200:3:1, so the weight of CE loss is added to the training later 1:67:200.
(3) output. The output of the network is a three-column time series array with the same length as the label, representing no beat, beat, and downbeat respectively, and the value is in the range of 0 and 1 after sigmoid activation.
(4) Post-processing. The activation value output by the network cannot directly determine where the beat and downbeat are, and it needs to be post-processed with an HMM module later. Here, the HMM module that comes with the madmom package is directly used, and I did not write it myself.
(5) Network structure. I posted the picture in the thesis:
insert image description here
  Considering that the beat of the music with drums is easy to detect, but the music without drums is not easy to detect, the method in the paper is to separate the music tracks first, and separate the original music from the drum part and no drum part, which trains the network how to detect beats with and without drums. The tool used for audio track separation is Spleeter. Spleeter is a very effective music track separation open source code. There are 2-track, 4-track, and 5-track pre-trained weights available for download. In the algorithm, the 4-track mode of Spleeter is used to separate the music into four tracks of drums, bass, other, and vocals, and then combine the three tracks of bass, other, and vocals into one track called nodrum, so that There are two tracks of drum and nodrum. Since the method of using spleeter to separate and then generate features is very cumbersome, especially time-consuming in the inference stage, so the paper also provides a source separation sub-network to learn the separation results of spleeter and directly generate drums and non-drums. The spectral characteristics of the drum, this sub-network is equivalent to transcribing the spleeter function, and then combining the two processes of spleeter and madmom.
  Then extract features from the original music (that is, the mixture in the figure) and drum and nodrum, and then send them to three beat detectors Drum Beat tracker, Mixture Beat tracker, and NoDrum Beat tracker for processing respectively. The detector simply uses LSTM plus a fully connected layer without using a more complex network structure. The results obtained by the three beat detectors and the intermediate features in front of the fully connected layer are merged together, and then input as features to the fourth beat detector Fuser Beat tracker for fusion processing, and the final result is output to the HMM. During training, the outputs of the four detectors are used to calculate the loss with the label at the same time, and receive the supervision of the label for synchronous optimization, but the output of the best Fuser detector is used for inference.
  To sum up, the network first includes two sound source separation sub-networks OU. The two networks do not share parameters in the same structure. One network DrumOU is used to separate the input features into only the feature x_drum of the drum component, and the other network BeatOU separates the input features. into a feature x_nodrum that does not contain drum components; the network then includes four Beat and DownBeat joint detection sub-networks, the four networks do not share parameters in the same structure, the first three networks process drum, mixture and nodrum respectively, and the last network fusion Results of the first three networks.

3. Training method

1 Data enhancement
  is mainly to adjust the strength of the drum part of the sound, adding three cases of weak drum sound of 5dB, weak drum sound of 10dB and no drum sound. I also tried a variety of other data augmentation methods, but nothing else worked.
2 Training steps
  Step1: First fix the competition of four trackers in the network, and only train two OU sub-networks;
  Step2: Then fix two OU sub-networks, only train four tracker networks on the GTZAN dataset, and train on the Mango TV dataset Verification;
  Step3: After Step2 is fully trained, the training set is added to the Mango TV data set, and a few steps are trained. (There is no independent verification set at this time, and it is impossible to accurately evaluate whether the training is overfitting, so you cannot train too much.)
3 HMM tuning uses
  madmom.features.downbeats.DBNDownBeatTrackingProcessor in the madmom package to execute the HMM process. There are several functions in this function. These parameters must be adjusted. After fine-tuning the parameters, the score can be increased by 3~4 points.

4. Effect analysis

I don't know how to insert an audio file on csdn, so just use a picture to represent it:
insert image description here

figure 2

  Figure 2 shows the data of 30 seconds long music, with a point every 0.01s, a total of 3000 points, drawing one line is too crowded to see clearly, so draw three lines from top to bottom, each line represents a length of 10s. The blue and yellow lines are the output activation values ​​of the neural network (that is, the data input to the HMM), where the blue line is the beat dimension and the yellow line is the downbeat dimension. The lower row of black dots are marked beats, the upper row of black dots are marked downbeats, the red dots are the beats calculated by my algorithm, and the green dots are the downbeats calculated by my algorithm. The red and green numbers next to the points represent the time error between the calculated point and the labeled point, in ms, respectively. The 7 values ​​followed by beat and downbeat on the title respectively represent: fmeasure, pscore, cemgil, cmlc, cmlt, amlc, amlt, seven commonly used beat evaluation indicators, which are also the seven scoring indicators used in this competition. The following bs is the average score of the beat, dbs is the average score of the downbeat, and total is the total score after the average of the two.
  The analysis effect of this piece of music in Figure 2 is still good. Both beat and downbeat are found very accurately, as is the case with most songs, and the search is not bad. Below are some of the more common deductions.
insert image description here

image 3

insert image description here

Figure 4

insert image description here

Figure 5

  In Figure 3, the beats are all found correctly, but the position of the downbeat is found incorrectly. In this case, the most points will be deducted. Picture 4 is just missing, and Picture 5 is twice as much. In both cases, there are quite a lot of points deducted.

5. Summary and thinking

  My biggest gain in this competition is mainly to understand the new field of computational musicology, and to master some music and audio processing tools. In this competition, I also realized the huge gap with the previous leaders. Later, I also learned that the champion team used TCN network and GRU+MLP network. Among them, I have never heard of TCN network before. I just can quickly transfer an open source. Then modify and modify, let me design a network by myself, it still doesn't work, it shows that my grasp of deep learning is still very basic, not deep, and I am not proficient in the performance characteristics and usage of various network structures, and I still don't have the ability to create a network ability. Future study still needs to continue to improve on the basic ability.
  After the game, I also had some thoughts. I think that the algorithm can be improved and improved in the following aspects:
  (1) Whether the input part of the network can directly input the original audio data instead of the spectral features processed by libraries such as madmom, because the network You should be able to learn the link of generating features directly based on raw_data processing, so that it can be done end-to-end and the speed will be faster.
  (2) The detection of the beat does not need to calculate all 30s, because the beats are repeated, and the position of the beat should be determined after listening to one or two bars, which can greatly increase the speed.
  (3) The HMM of the madmom library used in the post-processing process always feels that it is not very good, because it is not trained according to this data set, and may not be optimal for this data set. A kind of trainable network should be designed structure, not only improves the accuracy, but also can be combined with the first two steps to truly achieve end-to-end and real-time high speed.

Guess you like

Origin blog.csdn.net/Brikie/article/details/119214836