Augmenting Performance of SMT Models by Deploying Fine Tokenization of the Text and Part-of-Speech Tag

This paper presents our study of exploiting the languages’ word class information augmented with some rule-based processing for phrase-based Statistical Machine Translation (SMT). In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Oromo, this problem can be particularly severe. In addition, there is variant nature or use of different symbols for ‘ hudhaa’ (the diacritical marker) in Oromo language which intrudes another severe data sparsity problem. In this work, we show that using fine tokenization of words considering intra-word behavior of words consisting hudhaa, and POS tag to modify the Oromo input and see how it improves Oromo-English machine translation system. The models were trained on a very small parallel corpus of data set (usually unacceptable for normal SMT system) and also the quality of the parallel corpus both in translation and spelling errors were not so good. Yet, our final system achieves a BLEU score of 2.88, as compared to 2.56 for the baseline system.


Introduction
Several phrase-based Statistical Machine Translation (SMT) models such as that of Och andNey (1999, 2004), Marcu and Wang (2002), and Koehn et al. (2003) have achieved a promising performance.These phrase based models have shown a number of advantages when compared with the pioneer models of Brown et al. (1993) which were actually based on word level expression recognition and local restructuring.The advantages gained from such phrase-based SMT are the result of moving from words to phrases as the basic unit of translation.
There is no question that these phrase-based models are much better statistical translation methods.But, although these systems have been successful, there are some potential drawbacks when it comes to 119odeling word-order differences between languages, because of the fact that phrase-based systems does not make use of little or only indirect use of the languages' syntactic information.That is the reason why they are still called "non-linguistic", as Nguyen and Shimazu (2006) quotes.That means, in phrase-based systems, the treatment of tokens are just words and phrases can be any sequence of tokens which may not necessarily be phrases in any sense.And the reordering models of these systems are based only on movement of distances, but not on the phrase content as Och and Ney (2004) and Koehn et al. (2003) argue.Another drawback of these phrase-based SMT without considering the language information is the sparse data problem, since obtaining huge parallel corpora is difficult and expensive for unexplored languages like Oromo-English.
In literal phrase-based SMT, the problem prevails since the inflected forms of the same word are often treated as different words.This problem is more serious and apparent when one or both of the source and target language is morphologically rich and an inflectional one.In our case, Oromo is an inflectional language and scares resource which needs special treatment to get a better phrase-based SMT system.In addition, the variant nature or use of different symbols for 'hudhaa' (the diacritical marker) in Oromo language intrudes another severe data sparsity problem.In this paper, we describe our approach to improve statistical machine translation by augmenting linguistic annotation of the language together with deeper in-word processing tokenization of Oromo in order to address the aforementioned data sparsity problems of phrase-based SMT.
The rest of this paper is structured as follows.Section II describes review of related works and frameworks that can leverage the task of MT in such scares resources, to emphasize the importance and motivation of our research; and briefly reviews about the importance and how it figures prominently in text process.Section III presents an elucidation of Oromo Language features intended to further emphasize the problem of the research.In Section IV the experimental setups undergone for the research was discussed.Section V is the part in which we present the experimental results, discussion of results, and evaluation.Finally, Section VI concludes our paper and suggests avenues for future work.

Related Works
Researches in MT involving Oromo language is not very common, even not a single one has been tried to this date.The main reason perhaps is the lack of parallel corpora.Nevertheless there have been many models for other languages, especially translation between English and other languages.As far as our information is concerned, there has never been an effort for building Oromo and any other language parallel corpus.Therefore, one of our goals in this work is to perform experiments with corpus that we collected by ourselves from various web sources for the research.
There have been attempts on using different approaches in development of Statistical Machine Translation.Goldwater et al. (2005) andDurgar et al. (2006) presented some very preliminary results for problems in developing a statistical machine translation system mainly dealing with different morphological concepts.Applying such language contexts, like morphological processing to SMT is not a new, but has been done at different levels.The idea goes back to Lee (2004) and Isbihani et al. (2006) for Arabic-English translation; Nießen and Ney (2004) for German-English translation; Ramanathan et al. (2008Ramanathan et al. ( , 2009)), for their first attempt on English-Hindi translation.Nguyen and Shimazu (2006) also analyzed improvement of phrased-based SMT with morpho-syntactic analysis and transformation for English-Vietnamese language.
In SMT, correspondences between the words in the source and target language are learned from parallel corpora, and often little or no linguistic knowledge incorporated to structure the underlying models (Adam Lopez, 2008).In particular, existing statistical systems for machine translation often treat different inflected forms of the same lemma as if they were independent of one another.But the bilingual training data can be better exploited by explicitly taking into account the interdependences of related inflected forms.To improve translation quality in framework with scarce resources, Nießen et al. (2004) proposed the construction of hierarchical lexical models on the bases of equivalent classes of words.Some other publications like Al-Onaizan, et al. (2000) have also dealt with the problems of translation with less-resources language pairs.The report was on an experiment involving Tetun-to-English translation by different groups, including one using SMT.

Oromo Language Features
The important phenomena in Oromo language is morphology.Oromo is a highly inflected language.Even though both English and Oromo are inflected languages, Oromo is more inflectional one.Most Oromo inflected word forms can be translated into an English phrase.
In addition to these morphological, inflectional and word order problems, Oromo language has another difficulty that creates data sparsity.This is about the variation of symbols used to represent hudhaa in Oromo texts.Hudhaa is diacritical marker or glottal symbol in Oromo.Nedjo et al. (2014a) presented the problem of tokenization in natural language processing of Oromo language to find tokens in a text because of hudhaa or the Oromo diacritical marker.Nedjo emphasized that hudhaa intrudes data sparsity problem.Because, some people just use apostrophe (') or use of the right-quote character (') or left-quote character ('), grave character (`), or acute accent or just symbols to indicate the glottal somewhere in the middle of words or as component of words.This paper investigates on data sparsity problem of Oromo in development of MT, among the many other difficulties mentioned above, by using Oromo tokenizer and Oromo POS tagger from Nedjo (2014 a&b) to train the model.

Architecture of the System
Deploying a clear architecture forms the central role in making up statistical machine translation, as it is true for any system development.As depicted on Figure 1, the architecture shows the interfacings and coordination of all the translation system components like Language Model (LM), Translation Model ™, Pre-processing and Decoder (Modelling) which are needed for development of SMT.Language Model is prepared form the target language and the Decoder gives the probability of target sentence given the source sentence.The Language Model (LM) gives the probability of a sentence which depends on the probability of the individual words.This language model is used to ensure the fluency of the output.So it is built with the target language (i.e English in this case).If 'T' is the target language, the LM computes P(T) and feed this into the decoder software.IRSTLM of M. Federico et.al. (2008) was used for language modeling.IRSTLM is an open-source language modeling toolkit and is hosted on sourceforge.The IRSTLM builds an appropriate 3-gram language model, removing singletons, smoothing with improved Kneser-Ney, and adding sentence boundary symbols:

Translation Model Module
The Translation Model ™ computes the probability of source sentence 'S', for a given target sentence 'T'.Mathematically, the probability being computed by TM is given as, P(S|T) and the output of the TM is fed into the Moses decoder.The translation model for this system has been developed using open source software, GIZA++ of Och and Ney (2003).GIZA++ is also free and open source software hosted at Google Code (Note 2), and a mirror of the original documentation can be found here (Note 3).

The Preprocessing Module
The preprocessing module renders preprocessing solutions for Oromo-English SMT.Oromo is a morphologically rich language, which poses some problems for statistical machine translation approaches.Much research in statistical machine translation has shown the importance of POS tagging and morphological preprocessing (aka, fine word-level tokenization) on translation quality.The common wisdom in the field is that such preprocessing helps, especially for morphologically rich languages, such as Oromo language, because it reduces model sparsity and increases source-target symmetry (particularly when the target is morphologically poor, as in English).For the target language-English, the preprocessing scripts accompanying the decoder software-Moses, have been directly used.For the source language-Oromo, the Oromo tokenizer and Oromo POS tagging, presented in Nedjo et al. (2014) and Nedjo et al. (2014), respectively, and some scripts from Moses have been used for implementation of this Oromo-English translation system preprocessing.

The Decoder or Modelling Module
The modeling decoder maximizes the probability of the generated sentence.It makes use of the argmax() function to maximize the probability.Moses software which is freely available under open source licenses was used for modeling the decoder.This Moses software is compatible with IRSTLM and GIZA++ tools and accepts the source language text as input and generates the target language text.Therefore, the job of the decoder is to find the highest scoring sentence in the target language-English (according to the translation model) corresponding to a given source language-Oromo sentence.The probability files are accepted from the LM and the TM.Moses decoder can be set in interactive mode for translation (Koehn 2007).The software is free and obtained from here (Note 1).It is being developed as a reference implementation of state-of-the-art methods in statistical machine translation, as stated in the Statistical Machine Translation text book of Koehn (2010) and Moses manual.

Data Sets
We conducted the experiments on parallel corpus of Oromo-English training, development, and test data sets.The statistics for these sets are shown in Table 1.Almost all of the data used were taken from the Holy Bible book and made available for the research, regardless of the quality of the translation for the reasons of the low quality as well as low quantity of the data.

Training the Translator
To ensure the fluency of the output of the statistical machine translation system, language model was trained and built with the target language-English, and used to train the decoder.The language modeling software, IRSTLM, which had already been installed, was used to train the target language data set.
To train the translation model, which is the main event in developing of statistical machine translation, we ran word alignment (using GIZA++), for phrase extraction and scoring, create lexicalized reordering tables and create the Moses configuration file.First an appropriate directory was created.Then the training command was executed.

Tuning for Quality and Speed
The parameters of the model were tuned by a small amount of parallel data, separate from the training data set, kept aside for development.The Moses software makes use of weights given in moses.inifile to translate texts.Default values are generated for these weights by the system during its training.So these weights are present in the configuration file of Moses, moses.ini.By tuning we set the model parameters in moses.inifile to improve the quality of the translation.
After taking considerable amount of time, the end result of tuning was moses.inifile with trained weights.The model was used to translate Oromo sentences into English by running the Moses decoder executable file.But it was too slow to star up and run the translator.It was very slow process and memory consuming task.So, binarized phrase-table and lexicalized reordering models were used as prototype machine translation system of this research.

Testing the Prototype
Having done with the binarized phrase-table and the lexicalized reordering table, we used another parallel data set (the test set) distinct from the ones we've used so far to evaluate the quality of the translated text, which helped to know how good the translation system is.The test set was similar in nature as the development-set used for development or tuning.It is taken from identical sources.In the same way as it was done for the training and development test, this test set was tokenized and truecased.Then the model that we've trained was filtered for this test set (in order to retain only the entries needed to translate the test set ) so that the translation is a lot faster.Automatic test on the translator was done by first translating the test set and then running the BLEU script on it.These test results were taken from both models, produced a BLEU scores that were used to compare the performance improvement.
As mentioned earlier, this research was conducted on the same corpus for two different settings of experiment.
The same data sets (same training set, development set, and test set) were used for the two experiments to compare the results.The first experiment was done to develop baseline translation system trained to translate Oromo sentences to English sentences.The second experiment was conducted to train a translator using the Oromo tokenizer that considers intra-word preprocessing for tokenization to handle the problem of diacritical marker-hudhaa (discussed in Nedjo et al. (2014)) and the Oromo POS tagger (discussed in Nedjo et al. (2014)).
The results of the experiments are discussed in the next topic.

Results of the Experiment: Discussions and Evaluation
The proposed Oromo-English statistical machine translation system accepts Oromo language sentences as input and gives the corresponding English sentences as output.Then Oromo sentence "Yesus Gooftaa dha", was supplied from the standard input.The model accepted this input in the interactive mode and produced acceptable English translation.One of the result of this translation output is shown in Figure 2.
In machine translation systems, there are two ways of evaluating performance of the model.One method is the automatic evaluation by a computer program and the other method is the manual evaluation by human translators.
The evaluation reported in this paper has implemented both mechanisms for different purpose focusing on the evaluation of the output of the system, on performance or usability evaluation, in particular.
Automatic evaluation was used to quantitatively measure and clearly compare the results of the two experiments.
The output of the baseline system and the new system, designed to complement performance of SMT for less-resourced language pairs by deploying word-level tokenization or fine preprocessing and POS tag of the text, is measured by BLEU.The model achieved improvement of 12.5% as shown in Table 2.The BLEU evaluation code is included in Moses decoder package and running the script on the same test data for both models gave the following results.In these experiments, manual evaluation was also used to judge the extent to which the prototype would be acceptable by the users and tell how promising result was obtained so that a full-fledged Oromo-English MT system is optimistically foreseeable based on this research.By executing the translator, 100 Oromo sentences were translated into English language.But only ten pairs of sentences, just 10%, are randomly taken as shown in Appendix A. The table shows the Oromo sentences along with the corresponding translation output of the Oromo-English SMT system, and the human translation reference sentences and average evaluation results of five translators based on adequacy and fluency parameters.There are some Out Of Vocabulary (OOV) words seen in the output, generated by the system and tagged as UNK.It is based on these translation outputs that the human evaluators have judged the performance of this research.
In this manual evaluation method, the translation was evaluated on the parameters of adequacy and fluency.
Adequacy is defined as the degree to which the reference sentence is conveyed in the translation; whereas, fluency refers to the grammatical accuracy of the translated text.So the translators used these adequacy and fluency since both parameters have different level for a given translation on which they can be evaluated.For this translation output, the level on which adequacy and fluency were evaluated is given in Table 3 and Table 4, respectively.The translated sentence is understood wrongly 1 The above mentioned parameters (adequacy and fluency) were evaluated by five persons who are native Oromo and are fluent in both languages.The geometric mean of the test parameters was taken and the result is shown in Table 5.This geometric mean was preferred to normalize the ranges being averaged, so that no range dominates the weighting.The names of the translators were kept unmentioned and alias name 'Evaluator #' was used as name of the respondents.

Conclusions and Final Remarks
With these experimental setups, discussed so far, the Oromo-English machine translation system produced the required translation output, even under such a scares resource constraints.With the automatic evaluation method, a BLEU score difference of 0.32, which is about 12.5 % improvement, was achieved.This model was trained on a very small parallel corpus of data set (usually unacceptable for normal SMT system) and also the quality of the parallel corpus both in translation and spelling errors were terribly poor.Over all, the accuracy of the model seems very low, when compared with trillion data set SMT models of other languages.Yet, this measurement was enough to evaluate and contrast the performances of the two models and proved that the idea of complementing the performance of SMT even for such a very small parallel corpus (only about 15,000 sentences) by deploying fine tokenization and POS tag is achievable.By the feedbacks obtained from the human evaluators, this Oromo-English machine translation prototype was found to be a promising startup and a full-fledged MT system may be optimistically adopted from this model.
The translation of 100 sentences was evaluated using human evaluation method on the parameters of adequacy and fluency.For both parameters, a geometric average of 3.45 and 3.48, respectively, was achieved out of 5 maximum points.This human judgment also confirms that the intended system has performed well.It has to be recalled that the quality of the translated text highly depends on the size and quality of the corpus used to train and develop the system, which this research lacks, indeed.
In the future, a better machine translation system is probably achievable by combining this idea and morphological analyzer.

Figure 1 .
Figure 1.Detail Architecture of Oromo-English SMT System

Figure 2 .
Figure 2. Result of sample Oromo sentence translation

Table 1 .
Data sets

Table 2 .
The baseline and the improved system translation outputs in BLEU

Table 3 .
Level of adequacy of the translator points

Table 4 .
Level of fluency of the translator points

Table 5 .
The prototype evaluation results by human translators