Kaggle Competition Summary: BirdCLEF 2023

  • Title of the competition: BirdCLEF 2023
  • Competition task: Identify bird calls in a soundscape
  • Question Type: Speech Recognition
https://www.kaggle.com/competitions/birdclef-2023

1. Competition background

Birds are considered good indicators of biodiversity change because of their high mobility and diverse habitat needs. The success of a restoration project can be judged by observing changes in the species mix and abundance of the birds. However, traditional observer-based surveys of bird biodiversity often require surveying large areas, which is expensive and logistically challenging.

In contrast, passive acoustic monitoring (PAM), combined with new analytical tools such as machine learning, allows conservationists to sample wider spatial scales with higher temporal resolution and to delve deeper into the relationship between restoration interventions and biodiversity. relationship between sex. Against this backdrop, the competition asked entrants to use their machine learning skills to identify East African bird species through sound recognition.

2. Competition tasks

Using machine learning and sound recognition techniques to identify species of East African birds by processing continuous audio data. Entrants will need to develop computational solutions to build reliable classifiers that can accurately identify the species to which birds belong based on their calls.

3. Evaluation method

The evaluation metric for this competition is the padded cmAP (padded cmAP)``, which is a derivative of the macro-averaged` based on the average precision (average precision) score.

To support predictions for species with no true positive labels, and to reduce the impact of species with very few true positive labels on the score, five rows of true positive labels are populated with each submitted prediction and solution before calculating the score.

4. The winning scheme

4.1 First place

https://www.kaggle.com/competitions/birdclef-2023/discussion/412808

data preparation:

  • Competitors used a variety of data sources, including race training data from 2023, 2022, 2021, and 2020, additional race data from 2020, Xeno-Canto data, and Zenodo data.
  • They observed a bug in the Xeno-Canto API that limited the number of files per species to a maximum of 500. They hypothesized that this bug affected the data loading process, limiting the representation of certain species.
  • They performed manual and rule-based deduplication of the training data, ensuring that each species had representative samples in the training and validation sets.

train:

  • Participants trained multiple models using different backbone networks, including eca_nfnet_l0, convnext_small_fb_in22k_ft_in1k_384 and convnextv2_tiny_fcmae_ft_in22k_in1k_384.
  • They adopted a training scheme that first pre-trained on previous game data and Xeno-Canto data, and then fine-tuned on the current game data.
  • The Focal Loss loss function, Adam optimizer and cosine annealing learning rate scheduling are used in the training process. To address the class imbalance problem, class sampling weights are employed.
  • Various data augmentation methods were used during training, including mixup, background noise augmentation without birdsong using Zenodo, random filtering (custom augmentation), and spectral augmentation with frequency and temporal masks.

Validation and integration:

  • Using a stratified cross-validation (5-fold) approach, the maximum probability is selected in each 5-second segment across all samples.
  • Competitors emphasized the importance of using the average of folds rather than out-of-fold when calculating the fill cmAP (competition evaluation metric).
  • Ensemble averaging is performed by using three SED models with different backbone networks.

4.2 Second place

https://www.kaggle.com/competitions/birdclef-2023/discussion/412707

training data:

  • Competitors used data from the 2023, 2022, 2021, and 2020 competitions as training data, and included additional training data from xeno-canto.
  • They did not use the records from the ebird website because the data is not public and cannot be used after confirmation.

Model architecture:

  • Participants used SED (Sound Event Detection) architecture and CNN architecture for model training.
  • The backbone network (backbones) of the SED model includes tf_efficientnetv2_s_in21k, seresnext26t_32x4d and tf_efficientnet_b3_ns.
  • The backbone network of the CNN model includes tf_efficientnetv2_s_in21k, resnet34d, tf_efficientnet_b3_ns and tf_efficientnet_b0_ns.

Pseudo and manual labels:

  • Competitors use the SED model to generate pseudo-labels and use quantile thresholding to extract potential nocall samples.
  • They then manually labeled potential bird-free samples by listening to the recordings. They manually tagged about 1800 records and saw no improvement. Probably the pseudo-labels contain more false positives (FP) than false negatives (FN).

Model training:

  • Various data augmentation methods were used during training, including Gaussian noise, pink noise, gain, noise injection, background noise (including no bird song samples from 2020, 2021 competitions, and no samples from rainforest, ambient sound, freefield1010, warblrb, and birdvox) bird song sample), pitch shift, time shift, frequency masking, time masking, OR
    Mixup of waveform and Mixup of spectrogram, etc.
  • To cope with the imbalance of the dataset, weights calculated based on primary and secondary labels are used.

Training phase:

  • Two stages of training are used: pre-training and fine-tuning.
  • In two stages, first train with CrossEntropyLoss and then with BCEWithLogitsLoss(reduction='sum').
  • To increase diversity, models are trained with different time windows and different Mixup rates, and some of them are only trained on CrossEntropyLoss. Additionally, 3 models are fine-tuned on 30-second time slices.
  • For each validation sample, the first 60 seconds are cut into segments, each segment is predicted, and the largest predicted probability is selected as the final prediction for the sample.

reasoning:

  • For the SED model, a 10-second audio clip is used for inference, but the CNN is only applied to the centered 5 seconds, and max(framewise, dim=time) is used to obtain the final prediction.
  • The SED model was also tested using the tta (2 sec) technique.

integrated:

  • Weighted average of raw logits. Although the output logit of the model is different and should not be simply added, it is considered that the absolute value of logit may contribute to the score, and this integration works best on LB (public leaderboard).
  • Convert logits to ranks, and perform a weighted average of the ranks.

4.3 Third place

https://www.kaggle.com/competitions/birdclef-2023/discussion/414102

Model architecture:

  • The contestants used a modified SED (Sound Event Detection) architecture that incorporates attention to frequency bands.
  • For species in the soundscape, they often occupy different frequency bands. The original SED architecture aggregates feature maps representing frequency bands via mean pooling, and applies attention to features representing time frames. Attention can be applied to frequency bands by rotating the Mel spectrogram by 90 degrees before feeding into the original SED network, which helps to distinguish species that vocalize simultaneously but with different pitches.

data preparation:

  • The 2021/2023 competition data was used, as well as datasets from xeno-canto, BirdCLEF 2019 Soundscape and DCASE 2018 Bird Audio Detection Task.
  • Do the necessary conversions to the file (for example, convert the file to a 32kHz sample rate).
  • Add the first 10 seconds of the xeno-canto file to the training set.

Model input:

  • Log Mel spectrograms of 5-second audio chunks are used as input to the model.
  • The parameters of the spectrogram are set as: n_fft=2048, hop_length=512, n_mels=128, fmin=40, fmax=15000, power=2.0, top_db=100.
  • Spectrograms are normalized between 0 and 255 and converted to 3-channel RGB images.

Model backbone/encoder architecture:

  • Pre-trained models such as tf_efficientnet_b0_ns and tf_efficientnetv2_s_in21k are used as feature extractors.
  • All models use weights pre-trained on ImageNet.

Data Augmentation:

  • A variety of data augmentation methods are used, including random selection of audio blocks, pseudo-labelling, random shifting, filtering, time-domain mixing, gain, pitch shifting, time-stretching, noise, etc.
  • Specifically for domain shift between training and test data, a reverb effect is added to simulate reflections in the soundscape.

4.4 Fourth place

https://www.kaggle.com/competitions/birdclef-2023/discussion/412753

An overview of the fourth-place program is as follows:

  • The scheme adopts the Knowledge Distillation method.
  • Added no-call data, xeno-canto data, and background audio (Zenodo) to good effect.

The datasets used in the scheme include:

-   Bird CLEF 2023数据集。
-   Bird CLEF 20212022数据集(用于预训练)。
-   ff1010bird_nocall数据集(5,755个文件,用于学习无鸟叫声)。
-   xeno-canto数据集中未包含在训练数据集中的CC-BY-NC-SA(896个文件)和CC-BY-NC-ND(5,212个文件)许可证的文件。
-   Zenodo数据集(背景噪声)。
-   esc50数据集(雨声、蛙声)。
-   aicrowd2020_noise_30sec数据集(用于预训练的背景噪声)。

The scheme is based on the second-place scheme of the BirdCLEF 2021 competition, using the timm library and eca_nfnet_l0 as the backbone network. In total, 4 slightly different model configurations were used.

The training process of the program:

 -   方案使用了预训练模型[bird-vocalization-classifier](https://www.kaggle.com/models/google/bird-vocalization-classifier/frameworks/TensorFlow2/variations/bird-vocalization-classifier/versions/1)中预先计算的预测结果进行知识蒸馏。
 -   训练过程使用了5折StratifiedKFold交叉验证,其中1折未涵盖所有类别,因此使用剩下的4折进行训练。
 -   使用评价指标padded_cmap1进行训练和验证,结果与padded_cmap5在大多数情况下相同。
 -   只使用primary_label进行预测,将no-call表示为全0-   损失函数使用0.1倍的BCEWithLogitsLoss(primary_label)和0.9倍的KLDivLoss(来自Kaggle模型)的组合。
 -   训练时使用随机采样的20秒音频片段,长度小于20秒的音频进行重复。
 -   训练时的epoch为400次,大多数模型在100-300次时收敛。
 -   采用early stopping策略,在预训练和训练过程中分别设定了10次和20次没有改善即停止训练。
 -   优化器使用AdamW,学习率为预训练阶段5e-4,训练阶段2.5e-4,权重衰减为1e-6-   使用CosineLRScheduler进行学习率调度,设置初始周期为10,热身周期为1,总周期数为40,循环衰减为1.0,最小学习率为1e-7-   使用mixup数据增强方法,其中p=1.0的效果优于p=0.5

The reasoning process of the scheme:

  • The 4 models were ensembled by simple averaging.
  • The 4 models are compiled using PyTorch JIT for faster inference.
  • The inference process took a total of 110 minutes.

The methods tried by the program include but are not limited to:

-   focal loss。
-   将样本较少的类别进行拆分。
-   使用eca_nfnet_l0和eca_nfnet_l1以外的骨干网络。
-   优化器的尝试(adan、lion、ranger21、shampoo)。
-   CMO(mixup在使用蒸馏方法时效果更好)。
-   CQT(速度较慢,效果下降)。
-   将第一个步长改为(1, 1)。交叉验证效果好,但推理时间较长,没有单个模型在2小时内完成。
-   使用Zenodo数据进行预训练(尝试蒸馏之前的故事)。
-   从头开始进行蒸馏训练(不使用ImageNet的权重),由于时间限制没有收敛。
-   在所有组合中将MelSpectrogram、PCEN和CQT集成到输入通道中,但没有协同效应。

The results of the protocol ablation study are as follows:

insert image description here

4.5 Fifth place

https://www.kaggle.com/competitions/birdclef-2023/discussion/412903

An overview of the Fifth Place program is as follows:

The scheme used the following datasets:

 -   2023/2022/2021年比赛数据。
 -   包含2023年比赛物种前景和背景的额外Xeno-Canto数据。
 -   ESC50噪声数据集。
 -   2021年比赛数据中的no-call噪声数据。

The scenario uses the following models:

  • The SED architecture is used, the same solution as the fourth place proposal.
  • Four different backbone networks are used: tf_efficientnet_b1_ns, tf_efficientnet_b2_ns, tf_efficientnet_b3_ns, tf_efficientnetv2_s_in21k.

The training process is as follows:

-   模型的训练分为两个步骤:
    使用2022/2021年数据进行预训练。仅使用了白噪声(p=0.5)进行训练。
    使用2023年数据进行微调,并使用不同的数据增强方法。
    
-   训练细节:
    所有模型都在5秒的音频片段上进行训练。
    使用了4折分层交叉验证。
    使用了主要标签和次要标签。
    使用了BCEWithLogitsLoss作为损失函数,根据评级给每个样本设置了权重。
    优化器使用了AdamW,学习率为5e-4,权重衰减为1e-3(大多数模型)。
    使用了CosineLRScheduler进行学习率调度。
    进行了40个epoch的训练,最佳分数几乎总是在最后一个epoch,除了4个fold的训练外,还使用了所有可用数据训练了第5个模型。这个完全拟合的版本在公共和私有的LB上一致比一个fold的模型好0.002-0.003。
    一些模型使用了软伪标签/硬伪标签进行微调。

Guess you like

Origin blog.csdn.net/wzk4869/article/details/131334909