Computer Aided Recognition of Vocal Folds Disorders by Means of RASTA-PLP

In the context of the recognition of vocal folds disorders, the systems based on acoustic analysis are being introduced as computer aided medical diagnosis tools due to its objectivity and noninvasive nature. Acoustic analysis is a complementary tool to those methods based on direct observation of the vocal folds by laryngoscopy; also, it can be used for the evaluation of surgical operation. This paper presents a novel approach in voice pathology assessment using RASTA-PLP feature extraction method in the framework of a HMM. The proposed method then compared to other feature extraction methods such as MFCC and PLP. The experimental results show that RASTA-PLP attained 92.86% correct classification rates and AUC of 0.94 compared to 0.81 and 0.79 for MFCC and PLP respectively.


Introduction
The laryngeal pathology has received much attention nowadays due to the modern way of life which led to an increased number of professionals whose working activity greatly depends on the use of their voice such as teachers, TV presenters, and singers; also unhealthy social habits such as smoking and too much alcohol drink may cause voice disorder.People are subjected to the risk of voice problems due to errors after surgical operations such as laser cordectomy, or Para thyroidectomy, etc. Acoustic analysis has proved to be an excellent tool for voice disorder detection and assessment.Voice assessment techniques may be categorized into two categories: subjective and objective techniques.Ear, Nose and Throat doctors use a subjective technique, which relies on the doctor's hearing to the patient's voice which may cause errors.The objective technique based on physical measurements obtained during phonation.It includes measures of vocal fold vibratory movement, such as laryngoscopy, glottography, digital stroboscopy, electromyography and videoendoscopy (Kukharchik, Martynov, Kheidorov & Kotov, 2007).These techniques are more accurate in diagnosing various laryngeal diseases due to their ability to capture the vocal folds movements.However, they are invasive, require costly resources and require experienced professionals.Also, it may cause much discomfort and sometimes generating resistance by the patients during examination, which may cause distortions in the data and thus produce false diagnoses (Adnene, Lamia & Mounir, 2003) and (Alonso, J., Leon, Alonso, I., & Ferrer, 2001).
In this paper, a novel approach to recognize the presence of pathology from voice records is proposed and discussed by means of short-time parameterization of the speech signal.The automatic recognition of voice alterations is addressed by means of Hidden Markov Models (HMM) and Relative Spectral Transform-Perceptual Linear Prediction (RASTA-PLP) complemented with short-term energy measurements.The proposed method is compared to other well known feature extraction methods such as Mel-Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Prediction (PLP).

Related Work
Over recent years, several studies have been carried out on the automatic recognition of vocal fold pathologies by means of acoustic analysis.These works can be categorized into two groups.The first group (A) concentrated on finding the most important parameters to estimate voice quality while the second group (B) concentrated on finding the best classifier to detect the vocal fold pathology.
In group (A), most of long term voice parameters that extracted from pitch data (Benesty, Sondhi & Huang, 2008) can be divided into four categories: fundamental frequency, amplitude perturbation, frequency perturbation and noise parameters.
A study like (Kasuya, Ogawa, Mashima & Ebihara, 1986), the authors proposed the NNE parameter for acoustic discrimination of voice disorders obtained an accuracy of 78.6% for NNE and 74.1% for HNR.In (Godino-Llorente, Ruiz, Lechon & Gomez-Vilda, 2008), the authors evaluate the capabilities of the GNE ratio for the screening of voice disorders, reporting an accuracy of 95%.The authors in (Yumoto, Gould & Baer, 1982) proposed the HNR parameter for acoustic discrimination of voice disorders reporting an error rate of 16.7%.In (Godino-Llorente, Ruiz & Gomez-Vilda, 2009), the authors proposed a new parameter that correlated with the perceived hoarseness, giving an indication of the degree of normality.The proposed index has been named Pathological Likelihood Index (PLI) reported accuracy in the screening of voice disorders equal to 95%.
Other works indicate that an accurate screening can be carried out by using a combination of several of the aforementioned acoustic parameters.An approach found in (Hadjitodorov & Mitev, 2002), where the authors use several parameters and a new parameter called turbulent noise estimation to detect pathological voices , the system reached an accuracy of 96.1% using a k-means nearest neighbor (k-NN).
Regarding group (B), the pattern recognition methods used for the automatic detection of vocal folds pathologies range from a simple classifier such as (k-NN) or a Linear discriminant analysis (LDA), to more complex techniques such as Gaussian mixture model (GMM), Hidden markov models (HMM), Support vector machines (SVM) and Artificial neural networks (ANN); Other approaches use hybrid classifiers.
In (Ananthaknshna, Shama & Niranjan, 2004), the authors used a simple (k-NN) classifier for voice pathology detection, yielding a classification accuracy of 89.19%.In (Shama, Krishna & Cholayya, 2007), a modification of the standard k-NN classifier was proposed to classify a set of 53 normal and 163 pathological speakers extracted from MEEI database.The best accuracy obtained was 94.28% by using HNR.In (Hariharan, Paulraj, & Yaacob, 2009), simple k-NN and LDA based classifiers are used for testing the effectiveness of the mel-frequency band energy coefficients (MFBECs) combined with singular value decomposition (SVD) based feature vector.The experiments were performed by using a subset of the MEEI database, with 53 normal and 657 pathological speakers; yielding classification accuracy of 99.59% for k-NN classifier and 98.48% for LDA classifier.
In (Godino-Llorente, Gomez-Vilda & Velasco, 2006) and (Godino-Llorente, Aguilera-Navarro & Gomez-Vilda, 2001), a probabilistic model GMM was used for classification between normal and pathological voices.In (Godino-Llorente, Gomez-Vilda & Velasco, 2006), the features used to train the classifier were Mel-Frequency Cepstral Coefficients (MFCC) along with their first derivative, obtained an efficiency of around 94% with 53 normal and 173 pathological speakers from MEEI database.In (Godino-Llorente, Aguilera-Navarro & Gomez-Vilda, 2001), the features used to train the classifier were MFCC and energy along with their first and second derivatives, obtained an efficiency of around 94% with 53 normal and 82 pathological speakers from MEEI database.
In (Dibazar, Narayanan & Berger, 2002), more complex probabilistic models, such as HMM have also been used for voice pathology detection reported different accuracies ranging from 97.75% to 98.3%.The features used in these cases are MFCC, the velocity and acceleration parameters, as well as different acoustic and noise measures.
Studies like (Godino-Llorente, Gomez-Vilda & Velasco, 2005) a discriminative classifier as SVM classifier was used to identify laryngeal pathologies.MFCC and noise features are used in yielding classification accuracy up to 95%.The study proposed in (Saenz-Lechon, Osma-Ruiz, Godino-Llorente, Blanco-Velasco, Cruz-Roldan, & Arias-Londono, 2008) considers a subset of the Kay database comprising 53 normal and 173 pathological sustained vowels.The authors investigate the performance of an automatic system for voice pathology detection when the voice samples have been compressed in MP3 format with different binary rates (160,96,64,48,24, and 8 kb/s).The feature set was MFCCs, HNR, NNE, GNE, energy, as well as their respective first derivative.The classification was performed using GMMs and SVMs classifiers.For these two classifiers, the best accuracy was 94.35 % for GMM and 93.01 % for SVM.The authors highlighted that there are no significant differences in the performance of the detector when the binary rates of the compressed data were above 64 kb/s.
In (Marinus, Fechine, Gomes & Costa, 2009), the MLP used for discrimination among normal voice, voices affected by local fold Edema and voices affected by other pathologies (nodules, cysts and paralysis).The experiments were performed by using a subset of the MEEI database with 44 pathological speakers with Edema, 23 with other pathologies such as nodules, cysts and paralysis in the vocal folds, and 53 normal.The feature extraction based on cepstral coefficients yielded a correct classification rate above 99% for normal voice, 96% for Edema and 93% for other pathologies.In (Salhi, Talbi & Cherif, 2008), the authors proposed a technique that uses wavelet analysis to extract a feature vector from speech samples, which is used as an input to a MLP classifier, yielding best accuracy of 90% with 50 normal and 50 pathological speakers from a private database.
A study like (Wang, Zhang & Yan, 2011) uses hybrid of aforementioned classifiers.The GMM-SVM is proposed and the feature set used to train the new classifier was MFCC on MEEI database yielded classification accuracy up to 96.1%.

Methodology
This paper proposes a system for the discrimination between normal and pathological voice based on HMM classifier.The method employed based on Relative Spectral Transform-Perceptual Linear Prediction (RASTA-PLP) feature extraction technique.Then it's compared to other feature extraction methods such as MFCC and PLP. Figure 1 depicts a block diagram of the different steps carried out in the process set up for the recognition of voice alterations.A short description of each step is presented in the following sections.

Signal Pre-processing
Before the digital speech signal can be used for feature extraction, a process called pre-emphasis is applied to emphasize the high-frequency portion of the spectrum.Pre-emphasis is accomplished by passing the signal through high-pass filter whose transfer function is given by (Rabiner & Huang, 1993): where 0.9 ≤a ≤ 1 (1) Due to the boosting of high-frequency energy gives more information to the acoustic model, the value for the pre-emphasis parameter 'a' determined adaptively to be 0.97. Figure 2 illustrates the time representation of normal and pathological speech signal before and after pre-emphasis step.
The speech data then divided into overlapped frames of the length 20 milliseconds with frame shift interval 10 milliseconds and multiplied by Hamming windows.

Feature Extraction
Feature extraction aims at giving a useful representation of the speech signal by capturing the important information from it.A common division of the feature extraction approaches is production-based and perception-based methods.LPC is an example from the first group while MFCC, PLP, and RASTA-PLP belong to the perception-based approaches family.Since we want to simulate an experienced speech therapist who can detect the presence of a disorder just by listening to it, we'll focus on the perception-based group.

Mel-Frequency Cepstral Coefficients (MFCC)
MFCCs have been calculated following a non-parametric modeling method, which is basically originated from knowledge on the human auditory perception system.These coefficients are computed for each speech frame by weighting the magnitude spectrum by a mel-filterbank.The term mel refers to a kind of measurement related to perceived frequency.The mapping between the real frequency scale (Hz) and the perceived frequency scales (mels) is approximately linear below 1 kHz and logarithmic at higher frequencies (Feijoo & Hernandez, 1990).The suggested formula that models this relationship is described as follows (Deller, Proakis & Hansen, 1993): where f is the real frequency Hz 2 Then computing the log of each filter output and finally computing the Discrete Cosine Transform (DCT) of the log-mel-spectrum.The MFCCs are the resulting coefficients of this DCT operation.

Perceptual Linear Prediction (PLP)
The PLP feature extraction is similar to LPC analysis.It is based on short term spectrum of speech.In contrast to pure linear predictive analysis of speech, PLP modifies the short-term spectrum of the speech by several psychophysically based transformations in order to mimic human auditory system.In practice, PLP can give small improvements over MFCCs, especially in noisy environments and hence it is the preferred encoding for many systems.

Relative Spectral Transform-Perceptual Linear Prediction (RASTA-PLP)
The RASTA approach (Hermansky & Morgan, 1994) is based on a band-pass time-filtering applied to a log-spectral representation of the speech as shown in Fgure 3, in order to smooth over short-term noise variations and to remove any constant offset resulting from static spectral coloration in the speech channel.The PLP technique (just like most other short term spectrum based techniques) is vulnerable when the short term spectral values are modified by the frequency response of the communication channel.Hence RASTA methodology which makes PLP more robust to linear spectral distortions and yields better results for speech recognition tasks than PLP in noisy environment.

Temporal Derivatives
An improved representation can be obtained by extending the analysis including information about the temporal derivatives speed and acceleration of the parameters.This is especially important in the present case because it provides information about the short-term variability that is higher under pathological conditions (Childers & Sung-Bae, 1992).
To introduce temporal order into the parameter representation, we denote the mth coefficient at time t by c m (t) (Rabiner & Huang, 1993): Where µ is an appropriate normalization constant and (2K +1) is the number of frames over which the computation of the derivative is performed.For each frame at time t, the result of the analysis is a vector of L coefficients, to which two L-size vectors corresponding to the first and second time derivatives have been appended as follows: Where O(t) is a feature vector with 3•L elements.

Classification
The technique used for the classification stage was HMM.It is well known that the HMM are stochastic models that allow the representation of time series.The use of hidden states makes the model generic enough to handle a variety of complex real-world time series.
The proposed system uses the hidden Markov model toolkit (HTK Version 3.4).It was modified to accommodate the RASTA-PLP features as shown in Fgure 3. In addition, left to right HMMs, 3-state, 1-mixture were formed.
The Expectation-maximization (EM) algorithm was used to train the HMM and a series of experiments were carried out with this HMM topology.In all of the experiments of this study, five training iterations were enough for good convergence of model likelihoods.

Data Collection
To collect the voice data, the collection was done in a sound proof room of the Phoniatrics department of Kobri Elkobba Hospital.The acoustic samples correspond to sustained phonations (1-3 s long) of vowel /ah/ from patients (males and females) with normal voices and a wide variety of vocal folds disorders such as Cyst, Polyps, Nodules, Paralysis, Edemas and Carcinoma.Table 1 shows the database of vocal fold diseases.The files were obtained with low noise level, constant microphone distance around 15 cm from the talker's lips, and 22 kHz sampling rate then quantized at a resolution of 16 bits per sample.We have made our experiments on 35 voices.
The HMM classifier has been trained with 60% of available speech records, the remaining 40% of records have been used for testing.

Performance Evaluation
In order to evaluate the performance of the detector and to enable comparisons to be made, several measurements (TP, TN, FP, and FN) and ratios (SE, SP, E, and AUC) were taken into account.
1) True positive (TP): The detector found an event (pathological voice) when one was present.
2) True negative (TN): The detector found no event (normal voice) when indeed none was present.
3) False positive (FP): The detector found an event when none was present 4) False negative (FN): The classifier missed an event.
5) Sensitivity (SE): Likelihood that an event will be detected given that it is present FN TP TP 100.SE   (5) 6) Specificity (SP): Likelihood that the absence of an event will be detected given that it is absent 8) Area under curve (AUC): is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.Since the AUC is a portion of the area of the unit square, its value will always be between 0 and 1.0.

Results
Table 2 represents the results corresponding to three independent feature extraction techniques MFCC, PLP and RASTA-PLP obtained from our private database.With respect to accuracy, it can be shown that RASTA-PLP parameters complemented with their first derivative are considered the best solution for our purpose where the accuracy reached to 92.86% and AUC equals 0.94 while the AUC of MFCC and PLP equals 0.81 and 0.79 respectively.
Looking at the results observed in Table 2, it is possible to infer that the behavior of the recognition system gets better when it is trained with RASTA-PLP features compared to MFCC and PLP features, where the recognition accuracy is reduced when the dimension of features was increased.

Discussion and Conclusions
The proposed scheme may be used for laryngeal pathology recognition.RASTA-PLP, PLP and MFCC feature extraction methods were used.The features are then tested with a Hidden Markov Model (HMM) classifier.Short-term RASTA-PLP complemented with the first derivative is revealed as a good parameterization approach for the recognition of voice diseases.We can conclude that the combination of the second derivatives do not show relevant influence on the results.
Anyway, a wider database of pathological voices is needed which it is not an easy work.

Future Work
Due to the fact that it seems to be easy to recognize voice disorders, the future work will be to identify the type of pathologies.For this purpose, the system should pass through two main steps: the first one deals with the recognition of voice disorder; once the presence is confirmed, the second step it will be voice disorder type identification.
the authors have used artificial neural networks (ANN) to differentiate between different levels of pathology according to a perceptual quality voice scale.A study like (Fraile, Saenz-Lechon, Godino-Llorente, Osma-Ruiz & Fredouille, 2009) the patients were split out and differentiated by sex.The feature extraction used to train the ANN was based on MFCC yielding a classification accuracy of 88.3% with 53 normal and 173 pathological speakers from MEEI database.

Figure 1 .
Figure 1.Block diagram of the computer aided recognition of vocal folds disorders