Open Access Paper
28 December 2022 Attention enhanced dynamic kernel convolution for TDNN-based speaker verification
Xiaofan Lang, Ya Li
Author Affiliations +
Proceedings Volume 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022); 1250605 (2022) https://doi.org/10.1117/12.2662523
Event: International Conference on Computer Science and Communication Technology (ICCSCT 2022), 2022, Beijing, China
Abstract
Speaker embedding is a state-of-the-art front-end module, which is used to extract discriminative speaker features for speaker-related tasks. The Time Delay Neural Network (TDNN) has been a classical network architecture since it was first applied on speaker related tasks known as X-vector. In this paper, we propose new network structures based on current popular ECAPA-TDNN. We propose a dynamic kernel convolution module to extract features from short-term and long-term context adaptively, thus achieving multi-scale receptive fields. We also apply three enhanced attention modules instead of plain Squeeze-Excitation (SE) layer to realize more efficient information interaction between channels and spaces. The proposed architectures are superior to the most advanced network, with an optimal Equal Error Rate (EER) of 6.40% and a parameters reduction of 6.32%, they also achieve better performances when speaker utterances are shortened.

1.

INTRODUCTION

The target of an Automatic Speaker Verification (ASV) system is to verify whether the speaker of an unknown utterance is the claimed one by comparing the test utterance with the registered utterances. There are two main types of Speaker Verification (SV) tasks: text-dependent (TD) and text-independent (TI), the former requires speakers to enrol utterances with a predefined fixed text, while the latter makes use of registered utterances without fixed text to verify the identity of the speaker. Therefore, text-independent SV has more application scenarios and is more challenging compared with text-dependent SV.

In recent years, deep learning methods have been widely used to extract the speaker representation with a fixed dimension called speaker embedding from the given utterance1-4. Recently, ResNet architectures5,6 and Time delay neural network (TDNN) architectures7-9 have been frequently used in SV tasks. Among these, ECAPA-TDNN9 and its variants provide10-12 the current optimal results. ECAPA-TDNN applied a 1D-Res2Net structure as the backbone blocks to process multi-scale speaker features and cut down the quantity of model parameters. It also applied a squeeze-excitation (SE)13 layer after each Res2Block to obtain channel-wise weights to rescale the frame-level features. Benefiting from its multi-scale feature extraction structure and channel attention strategy, it can obtain more distinguishing features for SV tasks.

In this work, we propose a 1D dynamic kernel convolution (DKC) structure inspired by selective kernel attention14 to adaptively capture features in short-term and long-term contexts. The proposed module can dynamically select the appropriate kernel size of the convolution according to the channel-wise weights. Inspired by References15-17, we replace the channel attention layer from the original SE module with a spatial pyramid attention (SPA) module, an efficient channel attention (ECA) module, and a convolutional block attention module (CBAM). The results of our experiments under three evaluation protocols using the proposed DKC module and the improved attention mechanisms show superior performance to baselines. Experiments on utterances of short duration verify the above conclusion.

The remainder of this paper is summarized as follows: Section 2 introduces two baseline system architectures of ResNet and ECAPA-TDNN. Section 3 detailed explains the proposed dynamic kernel convolution and robust channel attention architectures. Section 4 presents the detailed settings of our experiments. The complete results and analysis are given in Section 5. Finally, in Section 6, we summarize the full paper.

2.

BASELINE SYSTEM ARCHITECTURES

In this section, we will describe two distinct speaker verification architectures, both of which perform excellently on speaker verification tasks.

2.1

ResNet

The Thin ResNet-34 proposed in Reference5 is the first baseline system. It reduces the number of channels in the convolution layer of residual blocks, thereby cutting down the computational cost. The convolution layers of the network process 2D features on the frame-level feature extraction stage. Attentive statistics pooling4 is used as the temporal pooling strategy to concatenate the first-order and second-order statistics of frame-level features of each frame in the time dimension, thereby generating utterance-level features. See Reference5 for more details about the topology.

2.2

ECAPA-TDNN

The second baseline system is ECAPA-TDNN. It is a strengthened version of vanilla X-vector system2,3. It utilizes the hierarchical residual structure of the Res2Net18 to capture multi-scale feature. It also integrates the SE module into the residual blocks to rescale the frame-level features per channel. Channel attention method is used at the temporal pooling layer to generate different attention coefficients for different frames for each feature map. The method of multi-layer feature aggregation and summation is used to generate the input features of the pooling layer using a dense layer, which concatenates the output feature maps of both deeper and shallower SE-Res2Blocks.

3.

PROPOSED SYSTEM ARCHITECTURES

3.1

1D dynamic kernel convolution

The dynamic kernel convolution (DKC) is a dynamic channel selection mechanism. It is a multi-branch convolutional module which can select the kernel size adaptively in order to capture features in short-term and long-term contexts. The complete structure of the dynamic kernel convolution consists of three parts: split, attention, and select. A two-branch case is depicted in Figure 1.

Figure 1.

Dynamic Kernel Convolution (DKC) module.

00036_PSISDG12506_1250605_page_3_3.jpg

At the split stage, for the input feature X ∈ ℝC×T, we conduct transformations F1 : XU1 ∈ ℝC×T and F2 : XU2 ∈ ℝC×T as two 1D convolution operators with two different kernel sizes k1 and k2, respectively. In the proposed model, the same kernel size is exploited in two branches, one branch uses the standard convolution while the other uses the dilation convolution. Such a dissimilarity can reduce network parameters while achieving almost the same performance.

At the attention stage, we combine the multi-scale information from different convolution branches by an element-wise summation:

00036_PSISDG12506_1250605_page_2_1.jpg

The mean μC and standard deviation σC of each channel of U are collected by a statistics pooling layer. Specifically, the c-th element of μ and σ is calculated as follow:

00036_PSISDG12506_1250605_page_2_2.jpg
00036_PSISDG12506_1250605_page_2_3.jpg

Taking the concatenation of μ and σ as input, we can obtain a compact feature z ∈ ℝd after a simple full connected (fc) layers as follow:

00036_PSISDG12506_1250605_page_2_4.jpg

where δ is ReLU function, В is batch normalization, the full connected layer Ffc([μ; σ]) equals to WT[μ; σ], W ∈ ℝ2C×d donates the weight metric, a compression ratio r is introduced to generate the squeezed dimension: d = C / r. Then softmax-weight of the i-th convolution branch can be obtained after another full connected layer:

00036_PSISDG12506_1250605_page_3_1.jpg

where τ is the softmax activation function, Vid×C is the weight metric, si ∈ ℝc×1 donates the channel-wise attention vector for the feature map Ui (i = 1,2).

At the select stage, the c-th channel of the final dynamic representation YC×T can be calculated by the weighted summation over each branch as the equation:

00036_PSISDG12506_1250605_page_3_2.jpg

where si,c represents the c-th element of si, and Ui,c represents the c-th row of Ui.

3.2

Enhanced attention mechanisms

Considering that a wider temporal context contains more speaker feature information, SE module is used in two baseline systems to rescale the frame-level features using global properties of the utterance. Specifically, frame-level features are compressed through spatial dimension using a global average pooling, and channel-wise weights are produced using a multi-layer perceptron (MLP). Some enhanced attention mechanisms can replace SE module to further mine the context information of features.

Spatial pyramid attention (SPA) replaces the single-scale global average pooling layer in SE module with a group of adaptive average pooling (AAP) layers of different sizes. Such a spatial pyramid structure can capture more spatial information of the input feature map.

For a given feature XC×T, let wC×1 be the channel-wise attention weight vector and X′ ∈ ℝC×T donates the output feature map after channel rescaling. SPA module can be presented as following equations:

00036_PSISDG12506_1250605_page_3_4.jpg
00036_PSISDG12506_1250605_page_3_5.jpg
00036_PSISDG12506_1250605_page_3_6.jpg
00036_PSISDG12506_1250605_page_3_7.jpg

where Faap(X, si) donates the AAP layer with the output size of si, different outputs are resized into three 1-dimension vectors and concatenated to generate a 1-dimension attention map A∈ℝC′×1, R(·) donates the resize function. δ(·), τ(·) and Ffc(·) represent ReLU, sigmoid activation function and full connected layer, respectively.

Efficient channel attention (ECA) utilizes a 1-dimension convolution instead of MLP to generate channel-wise attention weights. The convolution operator can appropriately capture local-channel interaction while involving fewer parameters, which guarantees both effectiveness and efficiency. ECA module can be presented as following equations:

00036_PSISDG12506_1250605_page_4_1.jpg
00036_PSISDG12506_1250605_page_4_2.jpg

where Fgap(·) donates the global average pooling layer and Ck(·) represents the 1D convolution with kernel size of k.

Convolutional block attention module (CBAM) is composed of channel sub-module and spatial sub-module. The channel sub-module generates channel-wise attention weights utilizing max-pooling outputs and average-pooling outputs along time axis, then the spatial sub-module generates time-wise attention weights utilizing the two pooling outputs along channel axis and forwards them to a convolutional layer. CBAM takes both channel-wise and time-wise information interaction into consideration. It can be presented as following equations:

00036_PSISDG12506_1250605_page_4_3.jpg
00036_PSISDG12506_1250605_page_4_4.jpg
00036_PSISDG12506_1250605_page_4_5.jpg

where wcC×1 and wsT donates channel-wise and time-wise attention weight vector, respectively. Pmax(·, c) and Pmax(·, t) represent max-pooling layer along channel and time axis, respectively. Pavg(·, c) and Pavg(·, t) represent average-pooling layer similarly. Fmlp(·) is the MLP module that can be represented by Ffc(δ(Ffc(·))) specifically.

4.

EXPERIMENTS

4.1

Datasets

VoxCeleb19,20 is a commonly used dataset for speaker related tasks. VoxCeleb119 contains more than 150000 utterances from 1251 speakers, VoxCeleb220 contains more than 1 million utterances from 5994 speakers. We train models with a subset of VoxCeleb2 that we make by selecting 10 utterances randomly for each speaker. We use MUSAN dataset21 and RIR dataset22 to realize online augmentation. The augmentation methods are the same as the settings in Reference23. SpecAugment24 is added to the log Fbanks of the training samples with a frequency masking dimension of 8 and a temporal masking dimension of 10. All models are evaluated on VoxCeleb1.

4.2

Training settings

A two-second segment is extracted randomly from each training sample. The input features are 80-dimentional log Fbanks extracted from a hamming window with 25ms length and 10ms shift.

The proposed dynamic kernel convolution is used to replace the conventional 1D convolutions in each 1D SE-Res2block of ECAPA-TDNN baseline. Two branches are set for each DKC module. The first branch uses convolution with kernel size of 3, and the second branch uses the dilation convolution with kernel size of 3 and dilation factor of 2. Channel reduction ratio r is set at 16. Scale of each Res2block is set at 8.

We replace the SE layer with SPA, ECA, and ABAM, respectively. In SPA, three AAP layers are set with sizes of 1, 2, and 4, and the bottleneck dimension is set at 128. In ECA, kernel size of the convolution layer is set at 5. In CBAM, the bottleneck dimension in channel sub-module is set at 128, and kernel size of the convolution layer is set at 7.

All models use attentive statistics pooling as the temporal pooling strategy to generate utterance-level features. All convolutions for frame-level feature extraction are set to 512 channels, the last full connected layer is set as 192 dimensions to fix the dimension of the output speaker embeddings for all models. We train all models using AAM-softmax loss25 with a margin of 0.2 and a scale of 30.

4.3

Evaluation protocol

Performance of models is measured by equal error rate (EER) and minimum detection cost function (MinDCF) with Ptarget = 10−2 and CFA = CFR = 1. Models are evaluated on VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H. For the complete utterances, we split each utterance into 5 segments and then extract speaker embedding for each segment. For each trail, we formulate a score matrix with the size of 5×5 to calculate the cosine similarities of all pairs of segments. All scores are averaged to obtain the final similarity score. For each trail, we take segments with a duration of 4 s and 2 s to test the robustness of the models on short utterances. Cosine similarity is calculated directly as the final score.

4.4

Implementation details

We implement all models with PyTorch framework and conducted them on three NVIDIA Quadro RTX 8000 GPUs, each GPU has 48GB of memory. We train 80 epochs for all models. The initial learning rate is set at 0.001 and decreases 3% in every one epoch. The network parameters are updated by Adam optimizer26. The mini-batch size for training is 400.

5.

RESULTS

5.1

Results on VoxCeleb1

Table 1 gives an overview of the verification performance of all implemented models for full utterances, Tables 2 and 3 show the performance overview for 4 s and 2 s segments, respectively. DKC-TDNN with ECA and CBAM modules has reduced network parameters compared with baseline systems. Most systems we proposed apparently outperform the baseline systems in the case of full utterances. This reveals that it is beneficial to capture multi-scale features adaptively with dynamic convolution. The improved channel and spatial attention method can indeed bring performance improvement in a TDNN-based architecture. DKC-TDNN with SPA module, in particular, achieves EER of 6.40% and MinDCF of 0.375, gives a relative improvement of 17.84% and 1.99% in EER compared with ResNet and ECAPA-TDNN, respectively. DKC-TDNN with ECA achieves similar performance enhancement. DKC-TDNN with CBAM shows the minimal improvement. Almost all proposed systems have just slightly reduced or even increased MinDCF compared with ECAPA-TDNN baseline system. When the utterances are shortened to 4 s and 2 s, proposed systems have better performances either in EER or in MinDCF compared with two baseline systems. Finally, we assume branch structure reduces the inference speed of the network by comparing the computing speeds of all the proposed models to two baseline systems.

Table 1.

Performance of models on VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H for full utterances.

ModelParams. (M)VoxCeleb1-OVoxCeleb1-EVoxCeleb1-H
EER (%)MinDCFEER (%)MinDCFEER (%)MinDCF
ResNet6.727.790.4588.070.47411.550.587
ECAPA-TDNN6.656.530.3746.790.3969.890.502
DKC-TDNN (SPA)7.736.400.3756.690.3929.820.504
DKC-TDNN (ECA)6.236.410.3786.710.3979.890.505
DKC-TDNN (CBAM)6.606.470.3796.780.3989.980.506

Table 2.

Performance of models on VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H for 4 s segments.

ModelVoxCeleb1-OVoxCeleb1-EVoxCeleb1-H
EER (%)MinDCFEER (%)MinDCFEER (%)MinDCF
ResNet9.350.5349.350.52513.080.638
ECAPA-TDNN7.990.4357.900.44511.330.558
DKC-TDNN (SPA)7.690.4397.810.44111.320.559
DKC-TDNN (ECA)7.560.4337.800.44511.330.563
DKC-TDNN (CBAM)7.810.4447.930.44711.450.563

Table 3.

Performance of models on VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H for 2 s segments.

ModelVoxCeleb1-OVoxCeleb1-EVoxCeleb1-H
EER (%)MinDCFEER (%)MinDCFEER (%)MinDCF
ResNet13.430.65813.670.66817.680.767
ECAPA-TDNN12.360.62812.720.62416.570.731
DKC-TDNN (SPA)12.160.61012.540.62016.500.722
DKC-TDNN (ECA)12.000.61212.510.62116.590.734
DKC-TDNN (CBAM)12.290.61512.600.62516.610.731

5.2

Ablation studies

We conduct a series of ablation studies to reveal the impact of each component in our proposed architectures on the final performance. Table 4 gives the results on VoxCeleb1-O. Experiment A.0 is the ECAPA-TDNN baseline system. Experiment B.0 adds the DKC module to replace conventional convolution in each branch of res2blocks of the baseline system. Experiments A.1, A.2 and A.3 replace the SE module of the baseline system as SPA, ECA and CBAM, respectively. Experiments B.1, B.2 and B.3 replace the SE module of experiment B.0 similarly.

Table 4.

Ablation study of attention enhanced DKC on VoxCeleb1-O.

 SystemsEER (%)MinDCF
A.0Baseline (ECAPA-TDNN)6.530.374
A.1Baseline with SPA6.440.370
A.2Baseline with ECA6.500.387
A.3Baseline with CBAM6.380.383
B.0DKC without Channel Att.6.490.386
B.1DKC with SPA6.400.375
B.2DKC with ECA6.410.378
B.3DKC with CBAM6.470.379

The results of experiments A.0 and B.0 demonstrate the effectiveness of the DKC module. Benefits of all three enhanced attention mechanisms can also be evaluated by comparing experiments A.1, A.2 and A.3 with A.0. In particular, CBAM gives the lowest EER of 6.38% in three of the attention enhanced methods. we suppose that the temporal dimension information aggerated by the spatial attention sub-model in CBAM is useful for speaker feature extraction. Experiments B.1 and B.2 show that combine DKC module with SPA or ECA can achieve better results. However, comparing experiment B.3 with A.3, combine DKC module with CBAM achieves a worse result. The conflict of the two mechanisms needs further study. Ablation study of attention enhanced DKC on VoxCeleb1-O.

6.

CONCLUSION

In this paper, we introduce a dynamic kernel convolution and three enhanced channel attention methods for automatic speaker verification to achieve multi-scale receptive fields and more efficient information interaction in speaker feature extraction. Vast experiments on three evaluation protocols of VoxCeleb1 demonstrate that the novel architectures outperform two baseline systems both in full utterances and short-duration utterances. Our ablation study shows the effectiveness of proposed dynamic kernel convolution and three attention mechanisms, respectively.

REFERENCES

[1] 

Variani, E., Lei, X., McDermott, E., Moreno, I. L. and Gonzalez-Dominguez, J., “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE Inter. Conf. on Acoustics, 4052 –4056 (2014). Google Scholar

[2] 

Snyder, D., Garcia-Romero, D., Povey, D. and Khudanpur, S., “Deep neural network embeddings for text-independent speaker verification,” Interspeech, 999 –1003 (2017). https://doi.org/10.21437/Interspeech.2017 Google Scholar

[3] 

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. and Khudanpur, S., “X-vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE Inter. Conf. on Acoustics, 5329 –5333 (2018). Google Scholar

[4] 

Okabe, K., Koshinaka, T. and Shinoda, K., “Attentive statistics pooling for deep speaker embedding,” Interspeech, 2252 –2256 (2018). Google Scholar

[5] 

Chung, J. S., Huh, J., Mun, S., Lee, M., Heo, H. S., Choe, S., Ham, C., Jung, S., Lee, B. J. and Han, I., “Indefence of metric learning for speaker recognition,” Interspeech, (2020). Google Scholar

[6] 

Zhou, T., Zhao, Y. and Wu, J., “Resnext and res2net structures for speaker verification,” 2021 IEEE Spoken Language Technology Workshop, 301 –307 (2021). https://doi.org/10.1109/SLT48900.2021 Google Scholar

[7] 

Snyder, D., Garcia-Romero, D., Sell, G., McCree, A., Povey, D. and Khudanpur, S., “Speaker recognition for multi-speaker conversations using x-vectors,” in 2019 IEEE Inter. Conf. on Acoustics, 5796 –5800 (2019). Google Scholar

[8] 

Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmohammadi, M. and Khudanpur, S., “Semi-orthogonal low-rank matrix factorization for deep neural networks,” Interspeech, 3743 –3747 (2018). Google Scholar

[9] 

Desplanques, B., Thienpondt, J. and Demuynck, K., “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” Interspeech, 3830 –3834 (2020). Google Scholar

[10] 

Thienpondt, J., Desplanques, B. and Demuynck, K., “Integrating frequency translational invariance in TDNNS and frequency positional information in 2d Res2Net to enhance speaker verification,” Interspeech, 2302 –2306 (2021). Google Scholar

[11] 

Liu, T., Das, R. K., Lee, K. A. and Li, H., “MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances,” in 2022 IEEE Inter. Conf. on Acoustics, (2022). Google Scholar

[12] 

Mun, S. H., Jung, J. W., and Kim, N. S., “Selective kernel attention for robust speaker verification,” Interspeech, (2022). Google Scholar

[13] 

Hu, J., Shen, L. and Sun, G., “Squeeze-and-Excitation networks,” in 2018 IEEE Conf. on Computer Vision and Pattern Recognition, 7132 –7141 (2018). Google Scholar

[14] 

Li, X., Wang, W., Hu, X. and Yang, J., “Selective kernel networks,” in 2019 IEEE Conf. on Computer Vision and Pattern Recognition, 510 –519 (2019). Google Scholar

[15] 

Guo, J., Ma, X., Sansom, A., McGuire, M., Kalaani, A., Chen, Q., Tang, S., Yang, Q. and Fu, S., “Spanet: Spatial pyramid attention network for enhanced image recognition,” in 2020 IEEE Inter. Con. on Multimedia and Expo, 1 –6 (2020). Google Scholar

[16] 

Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W. and Hu, Q., “ECA-Net: Efficient channel attention for deep convolutional neural networks,” in 2020 IEEE Conf. on Computer Vision and Pattern Recognition, 11531 –11539 (2020). Google Scholar

[17] 

Woo, S., Park, J., Lee, J. Y. and Kweon, I. S., “Cbam: Convolutional block attention module,” in 2018 European Conf. on Computer Vision, 3 –19 (2018). Google Scholar

[18] 

Gao, S., Cheng, M. M., Zhao, K., Zhang, X., Yang, M. H. and Torr, P. H. S., “Res2net: A new multi-scale backbone architecture,” 2019 IEEE Transactions on Pattern Analysis and Machine Intelligence, 652 –662 (2019). Google Scholar

[19] 

Nagrani, A., Chung, J. S. and Zisserman, A., “Voxceleb: A large-scale speaker identification dataset,” Interspeech, 2616 –2610 (2017). https://doi.org/10.21437/Interspeech.2017 Google Scholar

[20] 

Chung, J. S., Nagrani, A. and Zisserman, A., “Voxceleb2: Deep speaker recognition,” Interspeech, 1086 –1090 (2018). Google Scholar

[21] 

Snyder, D., Chen, G. and Povey, D., “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, (2015). Google Scholar

[22] 

Ko, T., Peddinti, V., Povey, D., Seltzer, M. L. and Khu-Danpur, S., “A study on data augmentation of reverberant speech for robust speech recognition,” in 2015 IEEE Inter. Conf. on Acoustics, 5220 –5224 (2015). Google Scholar

[23] 

Das, R. K., Tao, R. and Li, H., “HLT-NUS submission for 2020 NIST conversational telephone speech SRE,” arXiv preprint arXiv:2111.06671, (2021). Google Scholar

[24] 

Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D. and Le, Q. V., “Specaugment: A simple data augmentation method for automatic speech recognition,” Interspeech, 2613 –2617 (2019). Google Scholar

[25] 

Deng, J., Guo, J., Xue, N. and Zafeiriou, S., “Arcface: Additive angular margin loss for deep face recognition,” in 2019 IEEE Conf. on Computer Vision and Pattern Recognition, 4690 –4699 (2019). Google Scholar

[26] 

Kingma, D. and Ba, J., “Adam: A method for stochastic optimization,” in 2015 Inter. Conf. on Learning Representations, (2015). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Xiaofan Lang and Ya Li "Attention enhanced dynamic kernel convolution for TDNN-based speaker verification", Proc. SPIE 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022), 1250605 (28 December 2022); https://doi.org/10.1117/12.2662523
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Convolution

Speaker recognition

Feature extraction

Performance modeling

Systems modeling

Astatine

Network architectures

Back to Top