Arabic-to-Malay Machine Translation Using Transfer Approach

Translation from/to Arabic has been widely studied recently. This study focuses on the translation of Arabic as a source language (SL) to Malay as a target language (TL). The proposed prototype will be conducted to map the SL ”meaning” with the most equivalent translation in the TL. In this paper, we will investigate Arabic-Malay Machine Translation features (i.e., syntactic, semantic, and morphology), our proposed method aims at building a robust lexical Machine Translation prototype namely (AMMT). The paper proposes an ongoing research for building a successful Arabic-Malay MT engine. Human judgment and bleu evaluation have been used for evaluation purposes, The result of the first experiment prove that our system(AMMT) has outperformed several well-regarded MT systems by an average of 98, while the second experiment shows an average score of 1-gram, 2-gram and 3-gram as 0.90, 0.87 and 0.88 respectively. This result could be considered as a contribution to the domain of natural language processing (NLP).


Introduction
Arabic is one of the natural languages that is spoken by hundreds of millions of people as a native language, besides, it is the language of prayers for around 1.4 billion Muslims around the world (Shaalan. et.al., 2019). Arabic is considered as a derivation-language, subject pronoun-drop, and Subject-Verb-Object (SVO) structural language by default. On the other hand, Malay is the mother language of many people in southeast Asia (Hamza et. al, 2019). Morphological Speaking, Malay words are formed as: At the level of syntax, the default Malay sentence structure is Subject-Verbal-Object (SVO). Besides, Malay is a Verbal grammatical language(i.e., possessive, adj.).
According to (Al Saket et. al., 2014), Malay Language has a robust lexical features, besides, it is a language with no inflections at all for its verbs or nouns, recall, its morphology is formed by using affixes, composition and reduplication.
According to Almeshrky et al. (2012), researchers should take into their consideration three types of knowledge to obtain a proper translation for this pair of languages.
1. Comprehend the source language (lexicon, morphology, syntax, and semantics) to understand the meaning of the source text.
2. Comprehend the following features (lexicon, morphology, syntax, and semantics) in the target language to produce a better translation.

Related Work
Several researchers publish their articles in this domain, particularly for this pair of languages. Abdalla (2012) introduced a rule-based MT, he went through the morphological and syntactical analysis of the SL to obtain a syntactic structure, to be used for the final representation of the TL using a the transfer approach. Almeshrky et al. (2012)  Ambiguity in Arabic-Malay Translation System. These are several challenges that need to be taken into consideration in automation of Malay language.
• Several meaning for the same Arabic Word, let us take these two examples: 1. ( → kindhearted) may be translated as ("Baik budi" or "baik hati").

System Design and Architecture
According to (Shaalan, 2010), the transfer-based translation passes through three phases: 1. Analysis process.

Generation process
Initially, the input is analysed to have a certain SL structure that maps the "meaning" to generate a proper equivalent translation in the TL.

Analysis Module
we have analyzed the prototype lexically, morphologically and syntactically :

Lexical Databases
The information or features assigned to every individual words are usually defined as lexical resources, however, in our approach, we have developed a lexicon for Arabic-Malay words/phrase and we then assigned each words/phrase meaning with it features (i.e., number, gender, person, case, humanity, and alive/non-alive).

Tokenisation
The work "token" means splitting text into smaller units. The tokenization in our system extract clitics, the prefixes and the suffixes of each word in the input sentence (Attia, 2007). The process is shown in figure 1, a list of Arabic words list will be returnedas shown in figure 1 below. In this process, each word will be analyzed morphologically according to derivational rules (Badaro et. al, 2019) (Habash, 2008). the derivation algorithm invokes certain features (i.e., verb-adj, sub-noun, etc) of the input considering (number, gender, humanity, alive, etc...) (Shquier MMA, 2019, 2013.

Syntactic Analysis
Many researchers consider this process as a major component of any MT system, this particular process analyses the SL to determine a reasonable grammatical structure, then this information will be used to split the sentence into smaller unit. However, once the normalizer/tokenizer finished their task, the parser takes the input and return a list of their part of speech as shown in figure 2. Stanford parser has been used for this purpose [?].

Transformation Module
The transformation is carried out using two processes: 1. Lexical Transfer.
The transformation is carried out as follows: 1. Calling bilingual dictionary Arabic-Malay.
2. Calling parser to get POS. http://cis.ccsenet.org Computer and Information Science Vol. 13, No. 4;2020 The prototype framework is shown as a flow chart in Figure 2.

Generation Module
In this process, the output of the TL will be rendered according to to certain form concerning language grammar and meaning.
1. Accepts the Malay word to generate a well-format sentence.
2. Considering agreement and reordering as shown in figure 3. certain rules are considered during this process • Malay ignores the definite article in general.
• Malay dual nouns are translated by adding the word "dua" before the noun.
• Malay nouns are indirectly inflected for gender.
• Malay affixes attached to adjectives are mostly similar to those attached to verbs.
• Malay pronouns depend on the speakers' status.
• Malay possessive pronouns are not attached to noun.
• Malay classifiers (Penjodoh Bilangan) precedes nouns to show their amounts as follows: orang (person, people → ) is used for humans.
biji (seed → ) is used for small, round objects such as eggs, sweets and fruits.
batang (stick → ) is used for long, slim items such as pencils, pens, or sticks.  keping (pieces → ) is used for a piece/pieces of paper, bread, cake, cheques, photographs.
pucuk (shoots → ) is used for letters and arms.
• Most verbs are preceded by a verbal prefix(es), (i.e., meng-for active voice, di-for passive voice and ber-for intransitiveness).
• In Malay noun phrases, modifiers generally follow the head but quantifiers usually precede it.
• No inflections in Malay, instead, prepositions are used to indicate syntactical relations.
• Malay has no concatenated pronouns, instead they are separately written based on number, gender and tense features.
let us take an example on how the system handles the SL-TL word ordering based on the rules mentioned earlier, for the (SL) , the associated Arabic sentence matches the rule VD/1;NS/2;;N/3;J/4; then, the corresponding Malay mapping structure would be NNS/1;VBD/2;NNS/3;JJ/4, hence, the equivalent TL translation would be pelajar/NNX menyelesaikan/VBX 1 1 masalah/NNX yang sukar/JJ. The flow of the agreement and ordering process is shown in figure 3. it is worthy stressing that we have built 183 structures to map the SL sequence structure with its corresponding Malay structure, a sample of this table is shown in Table   4. Implementation and Design We have exhibited the entire process of out prototype in Figure 4, the developed design utilizes a framework developed by Hamdy N. Agiza (2012).   Full representation of the prototype is shown in Figure 4 and figure 5 respectively.

Experiment and Results
To judge the translation accuracy received by AMMT; we have tested our approach against human translation.

Test the prototype against the selected test examples.
2. Compare the output with the human translation.
3. Assign the reason behind the ill-translation to its corresponding category. 4. Assign a score (0-10) for each problem.

Experiment
Three well-regarded MT systems (i.e., Microsoft, Google, Yandex) are analyzed against our proposed system to evaluate the performance of the AMMT. In the first experiment, human judgment methodology is used for this purpose, while in the second experiment, we evaluate our system with iBLEU metric (papineni et. al., 2002).

Human Judgment
Basically we have compared the output of our proposed system against the human translation, we have built a test example (test suit) out of 130 examples that were carefully selected from scientific books, popular media channels, the result is as shown in table and figure 6 below. To judge the evaluation properly, we have constructed a matrix to relate the issue of translation to certain score according to the following criteria: 1. Def-Noun: This problem arises when the system fails to distinguish between the articles "a(n)" or "the". 9.
5. Addition and Deletion: 7.   Table ( i.e., the Def-Noun agreement), we could notice that this particular issue has been shown 4 times in Microsoft, 4 times in Google, 4 times in Yandex, and twice in AMMT. Therefore, hence, Def-Noun agreement has been arisen 14 times in all systems under evaluation. Figure 6 and Figure 7 represent the frequencies of these issues after conducting the human judgment experiment. The experiment shows that our system outperformed other systems with an average of 98.0, statistically speaking, only 2% out of the test examples have shown errors during the human judgment experiment.

The Bleu Evaluation
The BLEU metric ranges between 0 and 1, some translations may score 1, otherwise, they are quite similar. Due to this reason, even a human translator may not score 1. It is worth stressing that the higher score requires more reference translations per sentence. However, in this experiment, We compute the iBLEU scores (1gram, 2grams, and 3grams) for all test suit sentences. Afterward, we compute the overall average of each n-gram iBLEU scores. Table presents iBLEU score of Yandex against AMMT for 1gram, 2gram, and 3gram. , Table and Figure 8 show the iBleu scores of the 2 systems against two references on the test suit mentioned above. As shown in Table the    In this study, we developed a lexical MT system using a scalable transfer-based architecture for the translation of MSA into Latin-based Malay. The deliverable of this study: first: Arabic-Malay transformation structures development, second: the development of MT prototype based on transfer approach. third: Shed light on Arabic to Malay MT system challenges and proposes methods for handling them, and fourth: Test example development. These examples have been used in evaluating AMMT against Microsoft, Google, and Yandex. (i.e., "human judgment"), and iBLEU metric. The two experiments prove that AMMT has outperformed other systems.