| Network layer | Output dimension (none indicates the number of audio) | Number of parameters |
| Input_1(input layer) | (None,1,160000) | 0 | Melspectrogram_1(Mel spectrogram) | (None,64,625,1) | 0 | Batch_normalization_1(batch normalization) | (None,64,625,1) | 4 | Conv2d_1(convolutional layer) | (None,64,625,21) | 210 | Batch_normalization_2(batch normalization) | (None,64,625,21) | 84 | Activation_1(ReLU) | (None,64,625,21) | 0 | Max_pooling2d_1 | (None,32,313,21) | 0 | Conv2d_2(convolutional layer) | (None,32,313,21) | 3990 | Batch_normalization_3(batch normalization) | (None,32,313,21) | 84 | Activation_2(ReLU) | (None,32,313,21) | 0 | Max_pooling2d_2 | (None,16,157,21) | 0 | Conv2d_3(convolutional layer | (None,16,157,21) | 3990 | Batch_mormalization_4 | (None,16,157,21) | 84 | Activation_3 | (None,16,157,21) | 0 | Max_pooling2d_3 | (None,8,79,21) | 0 | Conv2d_4(convolutional layer) | (None,8,79,21) | 3990 | Batch_normalization_5 | (None,8,79,21) | 84 | Activation_4(ReLU) | (None,8,79,21) | 0 | Max_pooling2d_4 | (None,4,40,21) | 0 | Reshape(reduce dimension) | (None,160,21) | 0 | Gur1(GRU) | (None,160,41) | 7749 | Gru2(GRU) | (None,41) | 10209 | Dropout | (None,41) | 0 | Dense_1(Softmax) | (None,50) | 2100 |
|
|