An Adaptive Methodology for Ubiquitous ASR System

Achieving and maintaining the performance of ubiquitous (Automatic Speech Recognition) ASR system is a real challenge. The main objective of this work is to develop a method that will improve and show the consistency in performance of ubiquitous ASR system for real world noisy environment. An adaptive methodology has been developed to achieve an objective with the help of implementing followings, -Cleaning speech signal as much as possible while preserving originality / intangibility using various modified filters and enhancement techniques. -Extracting features from speech signals using various sizes of parameter. -Train the system for ubiquitous environment using multi-environmental adaptation training methods. -Optimize the word recognition rate with appropriate variable size of parameters using fuzzy technique. The consistency in performance is tested using standard noise databases as well as in real world environment. A good improvement is noticed. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI) using Speech User Interface (SUI).


Introduction
Speech User Interface (SUI) is a logical choice for man-machine communication, hence the growing interest in developing machines that accepts speech as input. Speech operated application in noisy environment is in demand, that is also very helpful to society for easy Human-Computer-Interaction.
However, a number of hurdles remain to make these technologies ubiquitous. In light of the increasingly mobile and socially connected population, core challenges include robustness to additive background noise, convolutional channel noise; room reverberation and microphone mismatch (IEEE Signal Processing Magazine, 2012). This so-called robustness problem not only leads to a significant degradation in performance but also hampers the fast commercialization of speech recognition applications.
Speech Recognition Systems give better results when the system is tested in conditions similar to the one used to train the acoustic models. It is very difficult to predict the noisy environment in advance in case of real world environmental noise and difficult to achieve environmental robustness.
Experimental results show that, a unique method is not available that will clean the noisy speech as well as preserve the quality which has been corrupted by real natural environmental (mixed) noise. It is also observed that, performance is depended on the parameters used while extracting speech signal features like, size of the window, frame, frame overlap etc. (Shrawankar & Thakare, 2012a) The adaptation is a technique helps current recognition systems to solve this problem. An adaptive method presented in this paper uses variable size of parameters (window size, frame size and frame overlap percentage), various categories and levels of noise to train the system. The method is developed (Shrawankar & Thakare, 2012b;Shrawankar & Thakare, 2010a) to clean noisy signals and enhanced them using two categories of techniques like traditional noise filters, and speech signal enhancement modified algorithms by considering all combination of enhancement techniques of three classes like Spectral Subtraction (Zhu, 2003), Subspace filtering (Ephraim & Trees, 1995), Statistical Filters (Lu & Loizou, 2010), independently as well as in combination.
A methodology is further developed for training all categories of noise that can adapt the acoustic models for a new environment that will help to improve as well as maintain the performance of the speech recognizer under real world environmental mismatched conditions. Training is done using Hidden Markov Models (HMM) (Sameti et al., 1998).
The analysis of performance is done using conventional as well as different objective (Ma et al., 2011;Ma et al., 2009;Hu et al., 2008) and subjective (Hu et al., 2006;Etame et al., 2011) measures that could be used to predict overall speech quality and speech/noise distortions introduced by representative speech enhancement algorithms from various classes (Shrawankar & Thakare, 2012b).
Performance of the system is tested for different categories of noise at various signal-to-noise ratio levels (Shrawankar & Thakare, 2013). Noise types include Airport, Car, Exhibition, Restaurant, Station, Street, Train, Factory, office, glass cabin etc.
The paper is organized as section 2 gives in detail of proposed Methodology followed by Empirical Process in section 3 and Results & Discussion in section 4 Concluding remarks are given in section 5.

Proposed Methodology
This complete experimental work focuses on following major issues for ubiquitous ASR performance improvement and maintaining consistency in performance.
 Cleaning speech signal: The first is speech signal filtering and enhancement for SNR improvement. The Indusial method and hybrid methods are implemented at back-end level and tested for the performance of the system with objective majors using SNR improvement test and subjective majors using listening test.
Following modified Filters and Enhancement techniques are used. Please refer cited papers for mathematical formulation.
o Basic fundamental filters: Low-pass, High-pass, Band-pass and Band-shop.
 Feature extraction: Extracting features from speech signals using various sizes of parameter (Zhu & Alwan, 2000). Five sets of features are extracted considering different size of frame, window and frame overlap.
 Training and testing using adaptive method: To train the system for all categories of environment, system uses ten different categories of environment speech samples, recorded at various locations. Five sets of parameters and ten categories of noise, total 50 training sets are used.
 Optimize the word recognition rate with appropriate variable size of parameters using fuzzy technique (Zadeh, 1965;Bezdek & Pal, 1992;Takagi & Sugeno, 1985): In this step a rule base Fuzzy Inference System (FIS) is used. SNR and World recognition accuracy are sent to the FIS as input parameters of all fifty set of features and best size of window, frame and frame overlap are computed for that category of noise as an output. Rules are framed to compute the output.

Empirical Process
A Software is prepared for the simulation. Experiment is performed with the help of following set of steps: Step 1: Samples Collection: Recording Recording is done outside at different locations mentioned above. Recording specifications are given in Table 1 and Table 2. Step 2: Speech Signal Analysis Voice / Unvoiced/ Silent (VUS) Signal Identification: The detection of the speech presence is calculated by detecting the beginning and end-point of an utterance using Voice Activity Detector (VAD) (Ramírez & Górriz, 2007). These two points detection algorithm is based on measures of the signal, zero crossing rate and short-time energy and checked whether the sample is voiced, unvoiced or silent. Only voiced samples are considered remaining samples are discarded. Step 3: Pre-Emphasis Under this Pre-emphasis step, Filters are implemented to estimate and reduce or filter the noise.
In order to illustrate the analysis of filtering and enhancement techniques fifty sets are considered.
The performance of the system is tested for all the considered combinations of the techniques. The noisy signals were filtered and enhanced using four categories of techniques like traditional noise filters for additive background noise; Adaptive filters for reducing reverberation effect, Normalization techniques for convolution noise and speech signal enhancement algorithms for clearing form distortion.
These filters and enhancement algorithms are implemented and tested for improving the intangibility of signal. The objective measures are checked by calculating the SNR and compared with SNR before implementing filter.

 Noise Filters
This category of filters is implemented for removing the noise from speech signals that are corrupted due to additive background noise.
In this experiment four fundamental traditional filters FIR like high-pass, low-pass, band-pass, and band-stop filters are implemented and tested. These filters are used for different frequency ranges, a high-pass filter for 20-22 Hz, a band-stop filter for 45-50 Hz and a low-pass filter for 3-4 KHz. Considering the energy of the signal, the speech is separated from noise.

 Adaptive Filtering
Room reverberation is also a one of the cause for speech signal distortion. Keeping this fact in mind, the system is tested using adaptive filters. These filters are implemented to improve the quality of speech signals those are distorted due to acoustic echo or reverberation. The quality improvement is tested with the help of adaptive filter algorithms like LMS, NLMS, ERLE, RLS etc.

 Normalization
The speech samples of words can be recorded using microphone, telephone device or any other recording instrument. There is a possibility that signals may get corrupted due to convolutional noise.
The normalization methods help to remove the convolutional noise originating from mismatches in microphone and/or channel characteristics, it is some form of speech enhancement.
Step 4: Enhancement Speech enhancement algorithms attempt to recover a clean speech signal from a degraded signal containing additive noise. The evaluation of performance measures are performed using nine speech enhancement algorithms encompassing different classes such as spectral subtractive, signal subspace, statistical-model-based (MMSE, log-MMSE, and log-MMSE under signal presence uncertainty) and Wiener-filtering type algorithms (the a priori SNR estimation based method, the audible-noise suppression method are considered and tested for the performance.
Step 5: Performance Evaluation Multiple methods are implemented independently as well as combinations of algorithms (Hybrid) to check the performance of a system. The performance evaluation is done on the basis of two performance measures, the first is objective evaluation using SNR improvement test and second is subjective quality evaluation is done using a informal listening test, spectrogram as well as waveform observation.

 Objective analysis (SNR improvement test)
The Signal-to-Noise Ratio (SNR) improvement test is considered as an objective measure (Ma et al., 2011;Ma et al., 2009;Hu et al., 2008) by calculating the SNR and is compared with SNR before implementing filter; spectrogram as well as waveform has been plotted and the clarity observed after implementing the filter or enhancement algorithm.

 Subjective analysis (Listening Test)
The subjective quality evaluation is done by using a listening test (Hu et al., 2006). The listening test is performed by normal hearing persons and the following parameters are observed. Subsequent Informal listening tests are conducted for subjective evaluation. This test is a qualitative test. Ten volunteers were requested to evaluate the performance of the speech enhancement methods that were implemented in this project. The listeners gave their decisions on an individual basis. Ten speech samples were considered, each (digit) isolated word sample for every listener. First all the samples were numbered and played in the same order in which it was enhanced. The listeners ranked the methods based on the intelligibility and quality of the enhanced speech. On the basis of this test listener's observations are noted down.
Step 6: Signal Enhancement using Hybrid Methods: As one of the aims of this work is to remove all categories of noise and distortion like additive noise, convolutional noise, reverberation etc. from the speech signal, the hybrid method is constructed. In the hybrid method, all enhancement methods are implemented with the combination of adaptive filters and normalization methods. Again the performance is observed using objective (SNR Improvement test) and subjective (informal listening test) parameters. The performance is tested for the proper combination of all categories of algorithms Step 7: Feature Extraction: Next important task is feature extraction. Signal is windowed with a specific window function (Hamming) using a window length, the word is partitioned into small units, called frames. The dimension of the frame is taken variable size from 10 ms to 50 ms, with 30-40% overlap. Feature extraction is executed for each frame independently. The spectrum is calculated for each window using the FFT. The spectrum is then filtered with a special Mel-scaled filter bank to get corresponding Mel-coefficients. The logarithm of Mel-coefficients is then computed. The discrete cosine transform is used to transform into the cepstrum-space. Non-necessary (high-frequency) MFCC-coefficients are discarded and finally 20 MFCC coefficients are considered. The extracted feature matrix (20 x 20) has been sent to train a model.
Step 8: Training & Decoding: The feature vector obtained from MFCC is used to train the model. Training is giving using all types of samples, clean, artificial noise added, real world environmental noise and enhanced. For training models, the method applied is based on Hidden Markov Models (HMM). This system considers a Bakis model. Training procedure completed iteratively.
For the first iteration, random or equal (latter is default) numbers of frames are assigned to each state. The system uses the number of inputs equal to the number of coefficients extracted from a frame, and the number of outputs equal to the number of states of the model. The system is trained so that vectors of coefficients corresponding to each state activate the corresponding output. After training, the outputs can be interpreted as the values of emission PDF.

 Decoding
The next phase is decoding. Viterbi algorithm is applied and the best path is efficiently obtained, the path which has the highest probability. The probability is obtained from both emission and transition probabilities of the model. The value represents the probability of that model (with current parameters) corresponding to the observations. During this phase (training), the model is adjusted so that probability increases. Considering the best path, the correspondence between each frame and each state gets modified. First consequence is the modification of transition probabilities. Second consequence is the modification of the input vectors. The next iteration will begin with the new values for probabilities.
All categories of noisy samples are considered for the training.
Step 9: Recognition After the system is trained, actual recognition begins. Given an unknown observation, determine which model generated it with more probability. Front-end analysis is applied and the coefficients are extracted. Then the probabilities of correspondence between each model and the observation are computed. This is done using Viterbi algorithm. The model with higher probability of compatibility is then recognized.
Word recognition accuracy is calculated using, Word Recognition accuracy is tested for unknown as well as trained samples with 20% overlap. Step

10: Best Solution (Feasible and Optimised) finding using Fuzzy Inference System (FIS)
While performing the experiment for evaluating the performance of speech processing methods, it is observed that every method behaves differently as parameter changes like hamming window size, frame size and overlapping size, filter used, enhancement algorithm implemented, category and type of noise etc. As it is very difficult to predict category of noise and implement proper variable size and algorithm for real world noisy environment.
Therefore it is desirable to obtain the best or optimized solution for these variabilities.
Finding the best variable size module uses a Rule-based Fuzzy Inference System. FIS is designed and computed best accuracy.
Fuzzy Approach is implemented with the help of five parts of the fuzzy inference process: • Fuzzification of the input variables • Application of the fuzzy operator in the antecedent • Implication from the antecedent to the consequent Three inputs are selected in the system, SNR value is passed for the Environment, Hamming windows size as WinSz and Frame overlap percentages as FrOver.
Input parameters, their membership function and ranges as follow. [Input1] Environment is defined as the value based on SNR, 10-20 dB is Very Noisy, 20-35 dB is Noisy and 35-50 dB is assumed for clean environment.
Name='Environment' After defining input, output and their membership functions, rules are framed and weights are assigned as given below 1. If (Environment is Clean) then (Accuracy is Better) (0.5) 2. If (Environment is Clean) and (FrOver is Medium) then (Accuracy is Best) (0.75)

If (WinSz is Medium) then (Accuracy is Better) (0.5)
Final step is defuzzification, output accuracy is observed for different rules and crisp value is obtained using DefuzzMethod, centroid.

Results and Discussion
Very first, performance analysis of different filters and enhancement algorithms are done with the help of SNR improvement test and listening test, results are shown in Table 3 and comparative study is shown in Figure 1.  Result shows spectral subtraction and wiener filters are suitable techniques for removing mixed noise. Further it is observed that hybrid methods (combination of adaptive filters, reverberation filters and enhancement method) improves the SNR value (shown in Table 4 and Figure 2) and helps in achieving better accuracy.   Table 5 for one sample and accuracy analysis is shown in Figure 3. Same experiment is performed for different samples collected from different locations.    Vol. 6, No. 1; Finally, these values are send to fuzzy rule based system and optimized size of variables are computed.

Conclusions
An Adaptive Methodology is very essential to improve the performance of ubiquitous ASR system as adverse environmental effects are not constant.
Adaptation is achieved by multi-environmental training with all probable combinations of variable sizes of window and frame etc while extracting features.
It is observed that hamming window size 245-250 ms and frame overlap 45% give best accuracy for ubiquitous ASR system.
As speech signal is cleaned with all possible hybrid methods, adverse environmental effects are normalized and hence environmental robustness is achieved.