Robust Voice Activity Detection with Deep Maxout Neural Networks

Voice activity detection (VAD) under non-stationary noises is a very important task to solve when using a real-life system of automatic speech recognition, especially if a remote microphone is used. Many existing methods do not work well with noise that changes over time or with very low signal-to-noise ratio (SNR). This paper proposes a method based on deep maxout neural networks with dropout regularization. The method is effective even for very low SNR (up to -5dB). The robustness of the method is demonstrated by low FR/FA error rates on a test dataset that was recorded under conditions different from the training dataset.


Introduction
Noise robustness of a speaker activity detector is a very important requirement for real-life use.That is especially evident when using a remote micriphone.Noise introduces substantial distortions into the speech signal.If the system is trained on clean data and used for noisy data, that leads to a significant accuracy reduction in detecting the boundaries of speaker activity.Most existing approaches require the information about statistical characteristics of the noise to be known beforehand.These methods can be divided into three categories: approaches based on a deterministic rule, statistical approaches and neural networks-based approaches.
Approaches based on a deterministic rule use a number of characteristics, such as zero crossing rate, short-time energy, autocorrelation coefficients, in order to compare acoustic features with a certain preset threshold to make a decision.In (Rabiner & Sambur, 1975) two acoustic features (log energy and zero crossing rate) are used to detect the boundaries of isolated words.This algorithm is very simple but it does not work in noisy conditions.The authors of (Savoji, 1989) first calculate the probability density function for the spectrum of each frame and then the entropy.They obtain speech and pause labels by using certain preset thresholds.This method does work for a noisy signal, but only for slowly changing noise levels, and it is not stable under low SNR.In (Krubsack & Niederjohn, 1991), a deciding rule is used based on pitch detection.A speech confidence measure is determined using a euristic procedure based on three features extracted from the autocorrelation function.In (Junqua et al., 1994) the method for detecting word boundaries is based on a time-frequency parameter which is formed from the energy in the frequency band and the log of short-time energy.The noise threshold is first calculated based on several initial frames of the input signal and then compared to the time-frequency parameter in order to determine the initial boundaries.Then the threshold rule is used to determine initial and final boundaries of words.The main drawback of approaches based on deterministic rules is that they use thresholds extracted empirically from a segment of non-speech signal.Consequently, such methods do not work for cases when noise levels change over time.They are also not effective for low SNRs.
Statistical approaches (employing hidden Markov models (HMM), Gaussian mixture models (GMM)) use maximum aposteriori probability (MAP) or maximum likelihood (ML) criteria for speech detection.It is assumed that a feature vector belongs to a certain class.Different clustering methods are used for solving the task.However, a great amount of training data for different types of background noises is needed for estimating the probability distribution.The quality of these approaches depends on the choice of probability distribution and the possibility of estimating the parameters of noise distribution.(Atal & Rabiner, 1976) solved the problem of speech detection for clean speech using an approach based on image recognition.Five acoustic features were used: zero crossing rate, short-time energy, the first coefficient of the autocorrelation function, the first-order linear prediction coefficient and the residual energy of linear prediction.The model for each class was a multidimensional Gaussian distribution.The MAP criterion was used for making the decision.In (Acero et al., 1993) HMM was used for modeling speech and pause classes, and the Viterbi algorithm was used for searching.(Bhiksha & Rita, 2003) describes using non-linear likelihood obtained from a Bayesian classifier.The main drawback of statistical methods is that the distribution of acoustic features for each class must be known beforehand.(Wu and Zhang, 2011) proposed using a linear weighted combination of different statistical models as the input of the unsupervised SVM.
Neural networks-based approaches use neural networks as template classifiers.There are two advantages in using such an approach.The first is that a neural network classifier is built directly on the training data without a strict assumption about the distribution of its classes.The second is the high discrimination capability of neural networks.(Qi & Hunt, 1993) proposes a multilayer neural network for detecting voiced and non-voiced fragments and pauses.Several features are used: cepstral coefficients, zero crossing rate and mean square energy.However, this approach completely ignores context information.In (Hong & Lee, 2013) RNN is used for classifying speech and non-speech fragments under noisy conditions.The authors demonstrate the advantage of using a RNN classifier compared to a GMM classifier.They describe the efficiency of the method under changing noise levels, however they only deal with different automobile noises.The paper (Zhang & Wu, 2013) proposes a deep belief network (DBN)-based VAD.DBN is a powerful hierarchical generative model for feature extraction.Unlike traditional methods of training deep models, DBN can prevent overfitting by using a special unsupervised pretraining procedure.A DBN-based VAD first connects acoustic features in a long feature vector, which is used as a visible layer or input DBN.Then a new feature is extracted as a result of the transition of the long feature vector through multiple nonlinear hidden layers.As a result, each class of observation is predicted by the linear classifier, so the output is the softmax layer of the DBN with a new feature at the entrance.
Deep neural networks have a long history.They may describe a highly variant function using several parameters.If the training is completed successfully, they can achieve good generalization capability even with a small volume of training data.
We propose using a deep neural network with a maxout activation function and dropout regularization.Dropout technology has shown its efficiency on small training data.Using maxout improves the accuracy of model averaging with dropout.The trained neural networks are highly effective for noisy data even under low SNR and in case of training and test data mismatch.Then the features are fed into a trained deep maxout neural network.The output of the network are the aposteriori probabilities of each frame belonging to one of the classes (speech, pause).For the correct interpretation of the speech segments and pause segments, a threshold is used for aposteriori probabilities.The threshold value may be selected automatically depending on the SNR.The threshold increases for larger values of SNR, so that VAD can separate speech from pause with greater certainty.Thus, the values of aposteriori probabilities will be close to 1 at high SNR.At low SNR, the aposteriori probabilities of speech segments may decline to 0.6.Anyway, the threshold does not fall below 0.55 in our case, since it is important to identify all the speech.The last stage is smoothing the frame labeling.By default, fragments shorter than 1 second are smoothed.

Training Features
The choice of training features is critical for any classification task.For a speaker activity detector, good features must satisfy two conditions: 1) the distribution of speech and non-speech fragments must be different, that is, good features must not overlap for speech and noise classes; 2) the features must be robust to noise.
We examined and compared the following features: mel-frequency cepstral coefficients (MFCC) (Kinnunen et al., 2007) with context, filter banks (Fbank) with context and with normalization of the cepstral average, gammatone frequency cepstral coefficients (GFCC) (Shao wt al., 2009) with context.The advantage of GFCC features compared to the others is that they are more robust to noise, so they work better for speech detection.

The Training Network
, which is calculated as matrix multiplication between the input vector of the layer and the weight matrix using the activation function A .As a result, ( ) where 1 ij W are the elements of the weight matrix of the first hidden layer, 1 ij b are the corresponding offsets, 1,..., , 1,..., i n j d = = .
After that, the neurons in the network are united into groups, each of which consititutes a maxout node.The number of groups in the experiment is 5.As the activation function A we use the maximum selected from several candidates of the maxout node.(Goodfellow et al., 2013) shows the advantage of the maxout network compared to differentiated activation functions, such as tanh (hyperbolic tangent), which consists of better approximation of model averaging.The maxout activation function is represented as ( ) where 1 r z are the values of the neurons in the r th maxout node, 1,..., r R = , R is the number of neurons in the maxout node.
Dropout is used at the output of the hidden layer.Most of the literature on deep training focuses mainly on regularizing the network so as to avoid overfitting.Different regularizing methods exist (L1, L2 (Bengio, 2012), L2-prior regularizing (Liao, 2013)).Dropout (Hinton, 2012), (Wang & JaJa, 2014) is a widely used and effective regularizing method for DNNs.It makes it possible to avoid complicated coadaptations on training data.On the other hand, the dropout procedure is an efficient way of averaging models with neural networks.A good way to reduce error on the test set is to average the predictions obtained from a very large number of different networks.
The standard solution is to train many separate networks and then apply each of them to the test data, but this process is very labor-intensive both for training and for testing.Random assignment of a zero value to a neuron makes it possible to train a large number of different neural networks in reasonable time.Networks for each training vector are trained in this way, but all networks have a common weight matrix.At the output, each neuron of the layer is assigned a zero value with the probability 1 p − .Experiments show that dropout increases generalization capability of the neural network and improves results on test data.Dropout is also efficient for small training datasets.Combined with maxout, dropout makes it possible to achieve exact rather than approximate model averaging and to fully utilize its potential.The use of dropout is illustrated in Figure 3. Using dropout at the output of the layer we get ( ) where M is the vector binary mask with the dimension d , ( ), 1,..., At the output of the classifier network we use a softmax layer which normalizes the sum of output values to equal 1 and makes it possible to interpret the outputs of the neural network as aposteriori probabilities: ( ) W is the vector of dimension parameters k , k is the number of classes (in our case, 2: speech and pause).

Training Conditions
According to the target function, we calculate the value of the training error E .We use the cross-entropy criterion as the target function (Golik, 2013) At the final training stage we calculate the increments W Δ for the weights of each neuron for their subsequent updating.We use the standard backward propagation for that (Rojas, 1996).Weight increments are calculated starting from the softmax layer and ending with the first layer.For softmax layer We introduce two differences from the standard backward propagation procedure.Firstly, increments for the dropout layer are calculated: Here the weights that were active during the direct pass are updated according to the mask M .Secondly, such a mask is also used for the maxout layer, that is, only the weights corresponding to maximum values are updated.The increments for the weights of first hidden layer 1 H are determined in the same way.

Training and Testing Datasets
The training and testing data for the neural networks were taken from the speaker database recorded at Speech Technology Center (STC).The database contains recordings of phonetically rich sentences using a remote all-direction microphone under various acoustic conditions (office, home, car, street).The microphone was located at the distance of 2 to 3m from the speaker, with a 0.5m error.The experimental dataset is described in Table 1.

Experimental Results
For testing the robustness of VAD with maxout DNN we trained several DNNs with different features.All the networks had two hidden 1000-dimensional layers.
The fbankCMN_2HLx1000_L2.net DNN was trained using Fbank with context length 15 using cepstral mean normalization.L2 regularization was used during training, network configuration was fully connected.
The mfccCMN_2HLx1000_L2.net DNN was trained using MFCC, in other respects it was similar to the previous network.
The gammatoneNet_2HLx1000_L2.net DNN was trained using gammatone features with context length 15.L2 regularization was used during training, network configuration was also fully connected.
The final network, maxoutCMN_2HLx1000.net,was trained using Fbank, context length 15, using cepstral mean normalization.We used dropout regularization for training, the network configuration was described above in Section 4, maxout activation function was used.
The test data were remote microphone recordings with the total duration of 3 hours, containing different types of office and home noises, as well as street, automobile, construction noises.The test data did not match the training data: a different type of remote microphone was used, the distance was not the same (the microphone could be further away from the speaker than 3m).
Table 2 shows the results of FR/FA, where FR is the error "speech as pause" and FA is "pause as speech".

Discussion
The table shows that using the maxout-activation function in combination with the dropout regularization reduces "speech as pause" and "pause as speech" errors four times.The reasons why this result is achieved are as follows.First, maxout does not use a fixed activation function, instead the function is created during training.Second, maxout is a universal approximator.Any continuous function can be approximated arbitrarily well on a compact domain by a maxout network with two maxout hidden units.Third, dropout performs model averaging, so maxout in conjunction with dropout enhances the accuracy of dropout model averaging technique and improves optimization.The maxout model benefits more from dropout than other activation functions.Through the use of such technology we can achieve greater robustness in noisy conditions.
This paper presents the results of the first experiments with DNN with maxout activation function and dropout regularization on noisy features.In the future, we are particularly interested in the following topics.We plan to perform experiments on the selection of dropout-regularization parameters and the size of maxout groups.
Perhaps an increased number of hidden layers can improve the final layout.We would also like to conduct more comprehensive experiments on different ranges of SNR.The described DNN with maxout activation function and dropout regularization was trained on noise fbank.We plan to use other features with different context lengths.Using gammatone features may give the best results under very noisy data.

Conclusion
The paper presents a robust speaker activity detector based on DNNs with maxout activation function and dropout regularization.It is well-known that the main problem for DNN training is often the insufficient amount of training data.Our experiments show a high efficiency of speech/non-speech detection for the proposed method even in case of mismatched training and test data.The effectiveness of the method is demonstrated by lower error rates compared to standard DNNs under low SNR (up to -5 dB).
Further research will focus on the use of more robust features with different context lengths and on different SNR ranges.Experiments are planned to select optimal parameters of DNN training (selection of maxout group size, dropout-regularization parameters, number of hidden layers).

Figure 1
Figure1shows the structure of the speaker activity detector.

Figure 2 .
Figure 2. The neural network for VAD with two hidden layers 1 H and
fully connected layer with DropOut becomes a sparse layer in which the values of neurons are updated randomly during training.Each element of the mask is independent for each training feature vector and in fact establishes different connections for each new feature vector from the training dataset.In addition, the mask is also applied to offsets during training.As with 1 Y , at the output of the second hidden layer we get the vector

Table 2 .
Comparison of FR/FA for different neural networks