An Investigation into Methodology and Metrics Employed to Evaluate the (Speech-to-Speech) Way in Translation Systems

Speech-to-speech translation is a challenging problem, due to poor sentence planning typically associated with spontaneous speech, as well as errors caused by automatic speech recognition. Based upon a statistically trained speech translation system, in this study, we try to investigate methodologies and metrics employed to assess the (speech-to-speech) way in translation systems. The speech translation is performed incrementally based on generation of partial hypotheses from speech recognition. Speech-input translation can be properly approached as a pattern recognition problem by means of statistical alignment models and stochastic finite-state transducers. Under this general framework, some specific models are presented. One of the features of such models is their capability of automatically learning from training examples. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. In this research, we want explore methodologies and metrics employed to assess the (speech-to-speech) way in translation systems. Keyword: Methodology, speech to speech, translation systems


Introduction
A Speech-to-Speech Translation (SST) system is composed of an Automatic Speech Recognizer (ASR) chained to a Spoken Language Translation (SLT) module and to a Text-To-Speech (TTS) component in order to produce the speech in the target language (Hamon & Mostefa, 2008).Speech-to-speech translation is a challenging problem, due to poor sentence planning typically associated with spontaneous speech, as well as errors caused by automatic speech recognition.Most speech translation systems reported in the literature operate within more or less restricted domains (Levin et al., 2000;Frederking et al., 2002;Gao et al., 2002;Rayner and Bouillon, 2002).Many are based on the Interlingua approach to translation; however, systems differ in their linguistic complexity.Knowledge-lean statistical machine translation approaches are nearly universally embraced for the task of unrestricted text translation (Koehn et al., 2003), perhaps because it is more difficult to effectively exploit knowledge in the broad domain.In restricted domains, rule-based and statistical-based approaches clearly show different strengths and weaknesses, which make them complement each other nicely (Wang & Seneff, 2004).
Moreover, the translation module of a speech translation system, a natural off-spring of text-input based translation system, usually takes a single-best recognition hypothesis transcribed in text and performs standard text-based translation.Lots of supplementary information available from speech recognition, such as N-best recognition hypotheses, likelihoods of acoustic and language models, is not well utilized in the translation process.The information can be effective for improving translation quality if employed properly.The supplementary information can be exploited by a tight coupling of speech recognition and machine translation (Ney, 1999) or keeping the cascaded structure unchanged but using an integration model, log-linear model, to re-score the translation hypotheses (Zang et al., 2004).

Speech Translation
The goal of the speech translation system research is to make straightforward real-time, interpersonal communication via usual spoken language for people who do not share a neutral language.Speech Translation (ST) is the process which spoken expressions are rapidly translated and spoken clearly in a second language.This is in contrast from phrase translation method, where the system merely translates a predetermined and finite  The METEOR score is computed by aligning the system output to the closest reference translation as in Figure 6.After stemming, cries and crying are considered a match, as are saying and says.In Figure 5, three words of the reference translation (in boldface) are not matched to the system output, and three words of the system output (not boldface) do not match the reference translation (Condon et al, 2009).

The TER, STER and HTER Measures
The TRANSTAC program has also experimented with the TER metric to measure translation quality.Unlike METEOR, TER allows any number of contiguous words to shift positions in a single move.Computation of the TER score is based on the Levenshtein edit distance measure for string matching (Cohen et al, 2003), which counts the number of insertions, deletions, and substitutions required to transform one string into another.Figure 6 shows how the alignment in Figure 6 would be edited to transform the system output into the reference translation.The deletions and substitutions that transform he says pain in into saying could have been aligned differently with no effect on the number of deletions and substitutions (Condon et al, 2009).

IBM's MASTOR
The IBM MASTOR shorthand for Multilingual Automatic Speech-to-Speech Translator is developed for the DARPA CAST and its mission is to develop technologies that facilitate rapid deployment of real-time System: he has stomach pain and always crying he says pain in stomach Edits: insertion substitution deletion substitution deletion substitution deletion Speech-to-Speech Translation of low-resource languages on mobile devices (Gao, et al, 2006).The general structure of MASTOR system has the components of ASR, MT and TTS.This pipelining approach allows system for the deployment of the existing speech and language handing out techniques, while taking care of unique problems in Speech-to-Speech Translation (Dureja and Gautam, 2015) Grapheme based acoustic models are used to overcome the problem of absence of short vowels Grapheme based acoustic model lead to unambiguous pronunciation of lexicons and hence facilitates the model training and decoding.Also, depending on its context the same grapheme may yield different phonetic sound and lead to less accurate acoustic models.For this reason two different approaches come into existence.The first one is to use short vowels known as full phonetic approach and the second one uses the context-sensitive graphemes in which two different phonemes are generated for the letter "A" (Alif) depending on its position in the word.The IBM ViaVoice product engine is a highly robust and efficient framework which is used for acoustic modelling by using rank based acoustic scores that are derived from tree-clustered context reliant Gaussian Models for both the desktop systems and hand-held systems (Narayanan et al, 2006)

Verbmobil
Verbmobil is a two way Speech-to-Speech Translation system which does not depend on the speaker.It is used for translation of spontaneous dialogs in mobile situations.It firstly identifies the input and further analyses and translates it, and finally delivers the final translation.This is a multilingual system which handles dialogs delivery in three-business-oriented domains where the translation depends on the context between three languages (German, English and Japanese) (Wahlster, 2013) This system deals with the spontaneous dialogs.In this case it doesn't mean just continuous speech like in the current dictation systems, but here rational disfluencies and repairing phenomena such as changing mid word, ums and arr, and some short words that are accidently left out in rapid speech are also included in the speech.For example, Verb Mobil corpus has the chance that 20% of all dialog turns having at least one auto-correction and 3% also include false starts.A combined approach for deep and shallow analysis methods is used by this system to find out the slips in the speech and then translate it in accordance to what the person tried to say rather than what was actually said by him (Dureja and Gautam, 2015)

Literature Review
Prior work on S2S translation has primarily focused on providing either one-way or two-way translation on a single device (Waibel et al., 2003;Zhou et al., 2003).Typically, the user interface requires the participant(s) to choose the source and target language apriori.The nature of communication, either single user talking or turn taking between two users can result in a one-way or cross-lingual dialog interaction.In most systems, the necessity to choose the directionality of translation for each turn does take away from a natural dialog flow.Furthermore, single interface based S2S translation (embedded or cloud based) is not suitable for cross-lingual communication when participants are geographically distant, a scenario more likely in a global setting.In such a scenario, it is imperative to provide real-time and low latency communication (Bangalore et al, 2012) Researchers have recognized that translation quality is multi-faceted and that human judgments of even more specific qualities such as fluency and fidelity are not always reliable (King, 1996;Turian, Shen & Melamed, 2003).Given the unevenness and cost of human judgments, researchers have welcomed automated measures such as BLEU and have proposed a plethora of alternative methods, all of which involve comparisons to one or more reference translations (Candon et al, 2008) In contrast, evaluations of speech translation have relied on human judgments such as the binary or ternary classifications adopted by CMU (Gates et al., 1996) and Verb Mobil (Nübel, 1997) researchers, which combine assessments of accuracy and fluency.Other methods use abstract semantic representations of the source utterances and require human judges to score structural elements of those representations separately.CMU researchers use the Interlingua Interchange Format to represent utterance intent and content (Levin et al., 2000).Sageetha and Jothilakshmi (2015) conducted a research named "Integrating Machine Translation and Speech Synthesis Component for English to Dravidian Language Speech to Speech Translation System".This paper provides an interface between the machine translation and speech synthesis system for converting English speech to Tamil text in English to Tamil speech to speech translation system.The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis.Many procedures for incorporation of speech recognition and machine translation have been projected.Still speech synthesis system has not yet been measured.In this paper, we focus on integration of machine translation and speech synthesis, and report a subjective evaluation to investigate the impact of speech synthesis, machine translation and the integration of machine translation and speech synthesis components.Here they implement a hybrid machine translation (combination of rule based and statistical machine translation) and concatenative syllable based speech synthesis technique.In order to retain the naturalness and intelligibility of synthesized speech Auto Associative Neural Network (AANN) prosody prediction is used in this work.The results of this system investigation demonstrate that the naturalness and intelligibility of the synthesized speech are strongly influenced by the fluency and correctness of the translated text.Sanders et al, (2013) conducted a research named "Evaluation methodology and metrics employed to assess the TRANSTAC two-way, speech-to-speech translation systems".One of the most difficult challenges that military personnel face when operating in foreign countries is clear and successful communication with the local population.To address this issue, the Defense Advanced Research Projects Agency (DARPA) is funding academic institutions and industrial organizations through the Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program to develop practical machine translation systems.The goal of the TRANSTAC program is to demonstrate capabilities to rapidly develop and field free-form, two-way, speech-to-speech translation systems that enable speakers of different languages to communicate with one another in real-world tactical situations without an interpreter.Evaluations of these technologies are a significant part of the program and DARPA has asked the National Institute of Standards and Technology (NIST) to lead this effort.This article presents the experimental design of the TRANSTAC evaluations and the metrics, both quantitative and qualitative, that were used to comprehensively assess the systems' performance.
Brian et al ( 2011) conducted a research named "Performance Assessments of Two-Way, Free-Form, Speech-to-Speech Translation Systems for Tactical Use".A critical challenge for military personnel when operating in foreign countries is effective communication with the local population.To address this issue, the Defense Advanced Research Projects Agency (DARPA) created the Spoken Language Communication and Translation Systems for Tactical Use (TRANSTAC) program.The program's goal is to develop speech-to speech translation technologies enabling English speakers to quickly communicate with the local population without an interpreter.DARPA has funded the National Institutes of Standards and Technology to lead the design and implementation of the TRANSTAC performance evaluations.This article presents these evaluations that enabled the collection of rich quantitative and qualitative metrics.He et al, (2011) conducted a research named "WHY WORD ERROR RATE IS NOT A GOOD METRIC FOR SPEECH RECOGNIZER TRAINING FOR THE SPEECH TRANSLATION TASK?" Speech translation (ST) is an enabling technology for cross-lingual oral communication.A ST system consists of two major components: an automatic speech recognizer (ASR) and a machine translator (MT).Nowadays, most ASR systems are trained and tuned by minimizing word error rate (WER).However, WER counts word errors at the surface level.It does not consider the contextual and syntactic roles of a word, which are often critical for MT.In the end-to-end ST scenarios, whether WER is a good metric for the ASR component of the full ST system is an open issue and lacks systematic studies.In this paper, they report recent investigation on this issue, focusing on the interactions of ASR and MT in a ST system.They show that BLEU-oriented global optimization of ASR system parameters improves the translation quality by an absolute 1.5% BLEU score, while sacrificing WER over the conventional, WER-optimized ASR system.They also conducted an in-depth study on the impact of ASR errors on the final ST output.Our findings suggest that the speech recognizer component of the full ST system should be optimized by translation metrics instead of the traditional WER.Bangalore et al. (2012) conducted a research named "Real-time Incremental Speech-to-Speech Translation of Dialogs".In this work, they addressed the problem of incremental speech-to-speech translation (S2S) that enables cross-lingual communication between two remote participants over a telephone.They investigated the problem in a novel real-time Session Initiation Protocol (SIP) based S2S framework.The speech translation is performed incrementally based on generation of partial hypotheses from speech recognition.They describe the statistical models comprising the S2S system and the SIP architecture for enabling real-time two-way cross-lingual dialog.They presented dialog experiments performed in this framework and study the tradeoff in accuracy versus latency in incremental speech translation.Experimental results demonstrate that high quality translations can be generated with the incremental approach with approximately half the latency associated with non-incremental approach.Hamon and Mostefa (2008) conducted a research named "An Experimental Methodology for an End-to-End Evaluation in Speech-to-Speech Translation".This paper describes the evaluation methodology used to evaluate the TC-STAR speech-to-speech translation (SST) system and their results from the third year of the project.It follows the results presented in (Hamon et al., 2007), dealing with the first end-to-end evaluation of the project.In this paper, we try to experiment with the methodology and the protocol during the second end-to-end evaluation, by comparing outputs from the TC-STAR system with interpreters from the European parliament.For this purpose, we test different criteria of evaluation and type of questions within a comprehension test.The results reveal that interpreters do not translate all the information (as opposed to the automatic system), but the quality of SST is still far from that of human translation.The experimental comprehension test used provides new information to study the quality of automatic systems, but without settling the issue of what protocol is best.This depends on what the evaluator wants to know about the SST: either to have a subjective end-user evaluation or a more objective one.Gao et al. (2006) conducted a research named "IBM MASTOR SYSTEM: Multilingual Automatic Speech-to-speech Translator".In this paper, they described the IBM MASTOR, a speech-to-speech translation system that can translate spontaneous free-form speech in real-time on both laptop and hand-held PDAs.Challenges include speech recognition and machine translation in adverse environments, lack of training data and linguistic resources for under-studied languages, and the need to rapidly develop capabilities for new languages.Another challenge is designing algorithms and building models in a scalable manner to perform well even on memory and CPU deficient hand-held computers.They described their approaches, experience, and success in building working free-form S2S systems that can handle two language pairs (including a low-resource language).Narayanan et al. (2006) conducted a research named "SPEECH RECOGNITION ENGINEERING ISSUES IN SPEECH TO SPEECH TRANSLATION SYSTEM DESIGN FOR LOW RESOURCE LANGUAGES AND DOMAINS".Engineering automatic speech recognition (ASR) for speech to speech (S2S) translation systems, especially targeting languages and domains that do not have readily available spoken language resources, is immensely challenging due to a number of reasons.In addition to contending with the conventional data-hungry speech acoustic and language modeling needs, these designs have to accommodate varying requirements imposed by the domain needs and characteristics, target device and usage modality (such as phrase-based, or spontaneous free form interactions, with or without visual feedback) and huge spoken language variability arising due to socio-linguistic and cultural differences of the users.This paper, using case studies of rating speech translation systems between English and languages such as Pashto and Farsi, describes some of the practical issues and the solutions that were developed for multilingual ASR development.These include novel acoustic and language modeling strategies such as language adaptive recognition, active-learning based language modeling, class-based language models that can better exploit resource poor language data, efficient search strategies, including N-best and confidence generation to aid multiple hypotheses translation, use of dialog information and clever interface choices to facilitate ASR, and audio interface design for meeting both usability and robustness requirements.Godden (2002) conducted a research named "Towards a Speech-to-Speech Machine Translation Quality Metric".General characteristics of a pragmatic metric for the production evaluation of speech-to-speech translations are discussed.While these characteristics constrain the space of allowable metrics, infinite definition space remains from which to select and define any particular metric.The recommended characteristics are drawn from the author's experience as primary developer of a text-based translation quality metric used in a production environment.The primary contribution is that of strict category ordering and two meta-rules that reduce the variance in assignment of errors to categories.

Conclusion
In this paper we investigated the methodology and metrics employed to assess the (speech-to-speech) way in translation systems.We talked briefly about speech translation.Then we introduced speech translation system and the components of it.We described Metrics of speech to speech translation system involved The BLEU Measure, The METEOR Measure and The TER, STER and HTER Measures.We explored Methodology used for automatic speech recognition involved IBM's MASTOR and VERBMOBIL.In the experiments we presented, some methods were applied to translating automatic speech recognition output for English utterances.Based on the Goddon study (2002), U2U (utterance-to-utterance) metric does not automatically become a good metric.The category definitions are of extreme importance, as are the examples used to illustrate the definitions and the training materials created for evaluators.Without clear, unambiguous and precise error definitions no metric will be of any practical value.Hamon and Mostefa (2008) found that that interpreters do not translate all the information (as opposed to the automatic system), but the quality of SST is still far from that of human translation.Bangalore et al demonstrated that high quality translations can be generated with the incremental approach with approximately half the latency associated with nonincremental approach.He et al, (2011) concluded that BLEU-oriented global optimization of ASR system parameters improves the translation quality by an absolute 1.5% BLEU score, while sacrificing WER (word error rater) over the conventional,

Figure 4 .
Figure 4. Sample Reference Translations and System Output

Figure 5 .
Figure 5. METEOR Alignment of System Output and Reference Translation

Figure 6 .
Figure 6.TER Alignment of System Output with Reference Translation and Edits

Ref 1 :
he has some pain in his stomach and always cries and complains about stomach pain Ref 2: he has some pain in his stomach and he always cries and says I have a stomach pain Ref 3: he has some stomach pain and always cries saying my stomach hurts Ref 4: he has a stomach ache and he always cries and says my stomach hurts System: he has stomach pain and always crying he says pain in stomach Ref 1: he has some pain in his stomach and always cries and complains about stomach pain Ref 2: he has some pain in his stomach and he always cries and says I have a stomach pain Ref 3: he has some stomach pain and always cries saying my stomach hurts Ref 4: he has a stomach ache and he always cries and says my stomach hurts System: he has stomach pain and always crying he says pain in stomach Ref 3: he has some stomach pain and always cries saying my stomach hurts System: he has stomach pain and always crying he says pain in stomach Ref 3: he has some stomach pain and always cries saying my stomach hurts