Neural Machine Translation: Fine-Grained Evaluation of Google Translate Output for English-to-Arabic Translation

The neural machine translation (NMT) revolution is upon us. Since 2016, an increasing number of scientific publications have examined the improvements in the quality of machine translation (MT) systems. However, much remains to be done for specific language pairs, such as Arabic and English. This raises the question whether NMT is a useful tool for translating text from English to Arabic. For this purpose, 100 English passages were obtained from different broadcasting websites and translated using NMT in Google Translate. The NMT outputs were reviewed by three professional bilingual evaluators specializing in linguistics and translation, who scored the translations based on the translation quality assessment (QA) model. First, the evaluators identified the most common errors that appeared in the translated text. Next, they evaluated adequacy and fluency of MT using a 5-point scale. Our results indicate that mistranslation is the most common type of error, followed by corruption of the overall meaning of the sentence and orthographic errors. Nevertheless, adequacy and fluency of the translated text are of acceptable quality. The results of our research can be used to improve the quality of Google NMT output.


Introduction
In the past years, the translation process has substantially changed because of technological advancements, such as the use of Internet and the availability of web-based machine translation (MT) systems (Johnson et al., 2017). MT is an approach to translating texts from one language to another. For a long time, MT had a poor reputation because its output was perceived to be of low quality (e.g., Agarwal et al., 2011). However, recent research has found that the quality of output has improved enough to be used in the translation industry (e.g., Chen, Acosta, & Barry, 2016).
MT has been developed since the 1950s, and different theories and practices have emerged over time. Recently, the quality of neural machine translation (NMT) has been the primary concern of researchers. NMT has emerged as an innovative translation approach that uses deep learning for translation of text in foreign languages (Wu et al., 2016).
Depraetere (2011) describes four techniques for evaluating MT: (1) human evaluation of adequacy and fluency, (2) automated evaluation techniques, (3) evaluation based on the analysis of errors, and (4) evaluation based on post-editing time. The present study uses the first and third techniques.
In the MT field, MT can be assessed manually or automatically (Lommel et al., 2014). Although automatic evaluation is objective and cheap, it is less comprehensive than human evaluation (ibid). According to Maučec and Donaj (2019), human evaluation is the most common option for assessing the quality of MT. Hence, automatic evaluation was discarded for this reason.
In human evaluation, the MT output is assessed by expert evaluators, proficient in translation, who should be bilingual in both the source and target languages (ibid). According to Bonnie et al. (2010, p. 809): "The fact is we have no real substitute for human judgments of translations. Such judgments constitute the reference notion of translation quality." Human evaluation can play a crucial role in improving MT technology; hence, research in MT is now moving toward integrating human quality assessment (QA) into the MT field (Girardi, 2014).
In terms of error analysis, this field of study is a part of applied linguistics. It aims to detect problems in translation and reveal the degrees and patterns of errors (Kafipour & Jahanshahi, 2015). In translation, identifying errors is crucial, especially for improving the quality of the end-product (van der Wees, Bisazza, & Monz, 2015). Hence, the present study adopts this method of analysis, wherein we apply (1) human evaluation of adequacy and fluency and (2) human error analysis. In the latter, human evaluators identify and classify translation errors and precisely describe specific deficiencies in the MT output.
In this paper, we provide a detailed overview of the types of errors in GNMT and identify its potential shortcomings using (1) human evaluation of adequacy and fluency and (2) human error analysis methods. Unlike previous works (e.g., Burchardt et al., 2017;Isabelle, Cherry, & Foster, 2017;Oudah et al., 2019), where NMT was compared with PBMT, we take a different approach by examining the possible deficits in the Google Translate system (as the most widely used free engine for translation).
To the best of our knowledge, no previous study has examined the Google Translate output for the English-Arabic language pair using the same methodology-specifically, adopting both (1) human evaluation of adequacy and fluency and (2) error analysis using taxonomies-after Google updated its system in 2017.

Google Translate: Statistical and Neural Machine Translation (NMT)
Google Translate was developed in 2006 and launched as the best statistical MT (Och, 2009). The translation process with the use of Google Translate entails using a computer system and the process is based on text patterns rather than using specific language rules as a reference .
In 2016, Google Translate received a significant update: it was improved by adopting NMT over statistical MT (United Language Group, 2017). NMT is an innovative method of MT, which creates more accurate translations than statistical MTs (Turovsky, 2018). Specifically, NMT uses a neural network-as in the human brain-wherein information is sent to various layers and is processed before the output (Cheng, 2019). NMT mainly focuses on the use of deep learning methods for translating text based on the already developed statistical models (ibid). Moreover, using deep learning techniques allows for faster translations than using statistical models alone. This enhances the ability of NMT to provide a higher-quality output during the translation (Cheng, 2019). Moreover, NMT uses algorithms to provide a better understanding of linguistic rules from the statistical models. One benefit of using NMT is its quality and speed (Cheng, 2019). Thus, NMT is believed to be an essential translation method of the future, and the translation capabilities with the use of NMT will continue to advance. NMT focuses on the translation of a whole sentence at a time (Turovsky, 2018). The current Google Translate is more accurate and has been estimated to be 60 times more accurate than the previous translation system (ibid).
For example, Popović (2018) examined the overall performance of NMT and PBMT for the German-English language pair. She manually annotated 264 sentences for English-to-German and 204 for German-to-English sentences obtained from a corpus of 3000 sentences. She found that the number of correct sentences in NMT was remarkably higher than PBMT. She concluded that NMT outperformed PBMT in terms of verb aspects (form, order, and omission), articles, English noun collocations, and German compounds, as well as phrase structure, which improves fluency. Many other studies have compared the output of NMT and PBMT for many language pairs, including the Arabic language (e.g., Burchardt et al., 2017;Isabelle, Cherry, & Foster, 2017;Oudah et al., 2019). Therefore, we will not be comparing PBMT and NMT in this paper. languages were added in March 2017: Russian, Hindi, Vietnamese, Thai, Hebrew, and Arabic (Jordan, 2017).

Related Works
MT can be evaluated by presenting the output of MT to bilingual human evaluators, who understand both source and target languages, to score the quality of a translation (Popovic, 2018). Human evaluators can adopt two different approaches. First, experts can evaluate adequacy (i.e., preservation of meaning) and fluency (i.e., grammaticality and overall quality; based on a combination of both), as well as estimated cognitive post-editing effort. Second, the experts can compare different MTs of the same source text to identify which translation is better without providing any scores (Callison-Burch et al., 2007). Ghasemi and Hashemian (2016) examined the quality of Google Translate's output for English-Persian and Persian-English translations using MT QA. The study focused on translating 100 selected sentences from Motarjem Harma, an interpreter application. The effectiveness of Google Translate was analyzed based on errors generated by two MT QA systems (Ghasemi & Hashemian, 2016). MT QA was used to analyze the translations using tables for different concepts: wrong word order, errors in the distribution and use of verbs, lexicosemantic errors, and wrong use of tenses. From the results obtained, Ghasemi and Hashemian (2016) found no significant differences between the two systems when translating from English to Persian and from Persian to English. Moreover, the analysis could not identify the error frequency in all types of texts translated by Google Translate.
Several studies have focused on error analysis and classification in the area of MT. Many researchers, such as Llitjós et al. (2005); Vilar et al. (2006); Bojar (2011), focused on design of taxonomies. For example, one of the most referred taxonomies in MT is the classification proposed by Vilar et al. (2006). They extended the work of Llitjós et al. (2005) and classified errors into five categories: "Missing Words," when some words in the translated text (TT) are missing; "Word Order," errors related to word order in the target sentence; "Incorrect Words," errors that occur when the system does not provide the correct translation of a given word; "Unknown Words," words found when the system copies the input word to the TT without changing it; and finally "Punctuation Errors." Similarly, Vilar et al. (2006) and Bojar (2011) classified errors into four types: "Bad Punctuation," "Missing Word," "Word Order," and "Incorrect Words." Many other studies, such as Popović and Ney (2006), evaluated error identification. In this paper, we examine a linguistically motivated taxonomy for translation errors that extends the previous ones. Our research is different in two ways: first, we provide a detailed examination and analysis of errors in MT output (specifically, for Google Translate) and, second, we examine the quality of MT output in terms of adequacy and fluency.
A related study conducted by Zaghouani (2016) presented guidelines and annotation procedures to create a human-corrected MT corpus for the Modern Standard Arabic. Zaghouani created comprehensive and simplified annotation guidelines with the help of a team of five annotators and one lead annotator. To ensure a high annotation agreement between the annotators, Zaghouani organized several training sessions for the annotators. It was the first published manual post-editing annotation of MT for the English-Arabic language pair (ibid).
Zaghouani created general annotation correction guidelines and classified errors under seven categories: spelling errors (which mostly occur in letters Yaa and Hamza), word choice errors, morphology errors (the use of incorrect inflection or derivation), syntactic errors (gender and number agreement, definiteness, wrong case, and tense assignment), proper name errors (when the names of entities are improperly translated into Arabic), dialectal usage errors (when the dialect is generally not present in the MT texts), and punctuation errors (in some cases, punctuation signs appear in the wrong place). Bojar (2011) manually identified errors to evaluate four systems: Google Translate, PC Translator12, TectoMT13, and CU-Bojar (Bojar et al., 2009). He applied two techniques of manual evaluation to identify error types discussed in the previously mentioned MT systems. The first technique is "blind post-editing," where the evaluation was performed by two evaluators separately. The first evaluator edited the system output and, thus, produced an edited version. The second evaluator worked on the edited version, compared the source and the reference translation, and judged whether the translation was still acceptable. The second technique was the manual annotation of the errors using a taxonomy inspired by Vilar et al. (2006). Condon et al. (2010) examined MT English-Iraqi Arabic and vice versa. They classified errors under "Deletions," "Insertions," and "Substitutions" for morphological classes and types of errors, following a similar taxonomy as proposed by Vilar et al. (2006).
No general rules for defining error categories exist (Popovic, 2018). In this paper, we classify errors using a similar approach as previous researchers. However, we use a slightly different taxonomy as the Arabic language

Translation Quality Assessment (TQA)
Translation QA (TQA) is the process of assessing a translated text in terms of its quality (Munday, 2001). To ensure a valid and reliable assessment, it has to follow particular rules and standards (Williams, 2009). However, the process of determining particular criteria for evaluating translation quality is a difficult task, which is believed to be "probably one of the most controversial intensely debated topics in translation scholarship and practice" (Colina, 2009, p. 236). That is because the assessment criteria are negotiable in the field of translation studies, as the relative nature of quality itself is believed to be too complex and too context dependent to be formulated under one definition (Nord, 1991). However, many researchers agree that assessing translation quality should measure particular issues, such as adequacy and fluency; these two metrics are most commonly used in human evaluation (White, 1994;Callison-Burch, 2007). For example, Gupta et al. (2011) assert that human evaluation is based on adequacy and fluency.
Adequacy (also called accuracy or fidelity) is defined as the extent to which the translation conveys the meaning of the source language unit (Koehn, 2009). Fluency is defined as the extent to which the translation follows the rules and the norms of the target language; thus, it focuses only on the target language unit (Casilho et al. 2018). Importantly, this aspect of evaluating the MT output is normally conducted at the sentence or segment level without considering the context of the translation (ibid).
An error analysis method aims at analyzing errors to obtain an error profile for a translation output (Popovic, 2018). It can be conducted either manually, automatically, or semi-automatically (combined method) (Popovic, 2018). The most obvious method for error analysis is to examine the translation output, mark each error in the translation, and assign a corresponding error tag to it (Guzmán et al., 2015). Error classification aims to identify and classify actual errors in a translated text.

Materials and Methods
Although human evaluation is expensive and time consuming, it is more accurate and can provide a more thorough analysis of the errors (Joshi et al., 2015) and can be performed by one or multiple evaluators. In the case of multiple human evaluators, the agreement among them can be calculated to provide additional information on the reliability of the results (ibid).
Data for this study were collected from English articles. We manually examined the samples for readability, potential translation problems, and MT quality. To identify the problems in the output of MT, a deep linguistic error analysis was conducted for a sample of English passages translated into Arabic by GNMT.
A total of 100 English passages were obtained from English articles and were translated into Arabic using Google Translate. The source and target passages were directly compared one by one by human evaluators, who used numerical ranges for judging the quality of the MT output. Specifically, the evaluators used the error analysis method and additionally evaluated adequacy and fluency when examining the GNMT output.
The general process of manual error classification is illustrated in Figure 1. Figure 1 presents error taxonomies that cover both translation aspects and linguistic aspects. Three evaluators, who are experts in linguistics and translation studies for the Arabic-English language pair, conducted a detailed analysis at the translational and linguistic levels and examined the MT translations for adequacy and fluency. Error analysis at the translation level includes the following seven types of errors. (1) Mistranslation errors (abbreviated in the figure as Mis.) comprise all errors related to incorrect translations of the source language content. (2) "Untranslated" errors (untrans.) occur when the source language content is not translated. (3) "Addition" errors (add.) occur when elements are added to the target text that is not present in the source text. (4) Omission errors (omit.) occur when elements are deleted from the target text that is present in the source text. (5) Lexical errors (lexis.) include word choice errors. (6) Orthographic errors (ortho.) include spelling and punctuation errors, where in some cases punctuation signs appear in wrong places. (7) Miscellaneous errors include errors that do not fall under any of the other categories, such as names of entities or concept that are being improperly translated into Arabic.
At the linguistic level, errors were categorized into three levels: syntactic errors, grammatical errors, and semantic errors. Syntactic errors were subcategorized into errors that occur when the translation starts with a nominal sentence in the place of a verbal sentence in the ST (Nomi. sen. instead of v. sent.) and when the TT violates the entire phrase structure (viol. structure) (e.g., putting adjective before noun). Grammatical errors include violating subject-verb agreement (viol. S-V agree), such as masculine and feminine; singular, dual, and Conversely, fluency can be evaluated by examining the target segments only; the goal is to examine the language quality of the translated text. The fluency score is defined as follows: "1" is given for incomprehensible target language, "2" is given for a disfluent target language, "3" is given for non-native kind of target language, "4" is given for a good-quality target language, and "5" is given for flawless target language (White et al., 1994). Evaluators rated the MT output using the predetermined scale described above. The scale ranges from 1 to 5, where 1 is the lowest score and 5 is the highest score.
There are three common inter-rater agreement metrics for the evaluation: the percentage of agreement, various versions of Cohen's kappa measure, and the intra-class correlation coefficient (Graham et al., 2012). The percentage of agreement is the simplest and the most straightforward measure. It provides basic approximation of the evaluators' agreement. Cohen's kappa measure is more rigorous than the percentage of absolute agreement because it considers the evaluators' agreement by chance. Typically, kappa measures the agreement between two raters. The intra-class correlation measures the agreement among evaluators when there are many rating categories (5 or more) or when ratings are made along a continuous scale (ibid).
Evaluators meet multiple times to identify taxonomies and classify the data under different categories and taxonomies. Before evaluating the dataset, evaluators agreed on 19 taxonomies to classify errors in the MT output. Because many errors were identified by one evaluator but not by the others, evaluators had to agree on particular errors to be considered in this analysis. As we have multiple well-defined labels and standards that each evaluator agreed on and clearly understands, the percentage of absolute agreement is used, which simply calculates the number of times evaluators agree on a rating. Importantly, evaluators have undergone training to develop a common understanding of how to apply the rating system as consistently as possible. Previous research shows that such a training improves accuracy, reliability, and validity (Woehr & Huffcutt, 1994;Gorman & Rentsch, 2009;etc.).
Subsequently, the inter-evaluator agreement was calculated for each label separately based on the evaluators' decisions at the meeting. The same approach was used for the adequacy and fluency measures. Their agreement was calculated for each label, and the average scores are presented in Table 2.
To judge whether an inter-rater agreement is sufficient or not, various experts (e.g., Hartmann, 1977;Stemler, 2004) contend that when using the percentage of absolute agreement, values from 75% to 90% demonstrate an acceptable level of agreement.

Results and Discussion
All of the evaluators identified and classified the errors at the sentence level in 100 passages translated by MT. Evaluators' agreement was first compared in terms of error localization to ensure that all evaluators agree whether there is an error in the sentence or not. Then, we took all agreed errors for all 19 classifications and added them to a separate column for better visualization of the results. In other words, errors must be agreed upon all evaluators to be considered as an error. Once the data were evaluated, the inter-rater agreement was calculated using the percentage of absolute agreement.

Error Taxonomies
As shown in Figure 2, evaluator 1 identified that mistranslation errors were the most common in the MT output. The second most common type was "corrupting the overall meaning of the sentence" followed by "lexical errors." Omitting necessary words category had zero errors and thus it is the lowest percentage in all categories. Figure 2 also shows that the least frequent errors were related to using definite articles before genitives, using unfamiliar words in place of collocations, using terms that convey very different meaning and using a noun in place of a verb.  Figure 3 shows the results for evaluator 2. The numbers suggest that mistranslation errors were the most common, followed by corrupting the overall meaning of the sentence and orthography. Moreover, using the definite article before genitives, wrong references, and omitting necessary words had the lowest percentage in all categories.  Figure 4 shows the results for evaluator 3. Here, mistranslation errors were the most common, followed by orthographic errors and corrupting the meaning of the sentence. On the contrary, errors as a result of using definite articles before genitives and omitting necessary words or phrases had the lowest frequency.  To illustrate how evaluators identified and classified errors in the data, this section gives one example for each of the 19 classifications. The expression "sexual abuse" has been mistranslated by MT. It could not differentiate between the expressions "sexual abuse" and "sexual assault," as they have been translated identically. However, there is a difference between the words "abuse" and "assault." The correct translation maintains this difference.

English (EN):
The US president also sparred at the White House with a Reuters correspondent, who asked him what he considered treasonous.

3) Example of an addition error
English (EN): Starting early Wednesday, crowds gathered in a half-dozen neighborhoods across Baghdad, with riot police attempting to disperse them using tear gas and firing live rounds into the air. In the above example, MT has added the word ‫حي‬ to the TT, which does not exist in the ST. Moreover, mistranslation errors occur in this example.

Linguistic Level
At the linguistic level, the evaluators considered fluency errors, which affected the quality of writing in the target language. This included lexical errors, orthographic errors, and miscellaneous error when errors do not fall under any of the other categories. Moreover, all grammar, syntactic, and semantic errors were identified in the MT output.
As shown in Table 3, the three evaluators agreed on 17 lexical errors, 26 orthographic errors, and 7 miscellaneous errors. Examples of these errors are listed below. MT has translated the word "fight" as ‫.معركة‬ However, the TT word is used inappropriately in this context, as the word ‫معركة‬ refers to a fight in a battle. Such inappropriate usages of words are identified as errors in the MT output.

English (EN):
A parade celebrating the formal ascension of Japan's Emperor Naruhito has been postponed in the wake of Typhoon Hagibis. The parade, which sees the emperor travel in an open-top car to "meet" the public, was postponed out of respect for the victims and their families. Orthographic errors include punctuation, capitalization, and spelling errors. In this example, the underlined sentence is nonessential information that is added parenthetically to a sentence; it is separated from the main sentence by commas before and after the sentence. The MT replicated the same punctuation system of the English language in the TT. However, the Arabic language does not have a parenthetical phrase or sentence; thus, commas are used wrongly in this situation. Moreover, the Arabic language does not have capitalization; hence, this category is discarded from the analysis.

3) Example of miscellaneous error
English (EN): Over a hundred demonstrators were arrested at yellow vest protests in Paris on Saturday as about 7,500 police were deployed to deal with the movement's radical anarchist "black blocs" strand. Miscellaneous errors are related to different types of errors such as the word "anarchist". It refers to a person who rebels against authority. This word has been translated as ‫أناركية‬ using the transliteration strategy. However, as the word "anarchist" has a direct equivalent in the Arabic language ‫,الالسلطوية‬ one could argue that the use of transliteration strategy is not the best option.

‫تم‬
Similarly, evaluators agreed on four errors of starting with a nominal sentence in place of a verbal one and agreed on five errors related to violating the entire phrase structure. Table 4 demonstrates the numbers of identified errors separately for each evaluator and the agreed number of errors. In English grammar, the sentence should always starts with a subject. However, this is not the case in Arabic grammar. In this example, the Arabic translation of the sentence followed the same word order (subject or noun + verb) of the English structure instead of following the Arabic grammar (i.e., using a verbal sentence).

5) Example of an error related to violating the entire phrase structure
English (EN): The wave of arrests comes ahead of a "million-man march" Friday called for by an exiled businessman whose online videos accusing Sisi and the military of corruption sparked last week's rallies. The MT sentence is not comprehensible because of a problem in its structure: the verb is omitted from the Arabic translation, which makes the sentence difficult to understand.

‫و‬
In terms of grammar, as demonstrated in Table 5, evaluator 1 identified four errors related to violating the subject-verb agreement, evaluator 2 identified six, and evaluator 3 identified one. However, at the meeting, evaluators agreed on three errors only. The case applies to the rest of the errors, as they agreed on 3 errors related to using a noun in place of a verb, 3 errors with using a verb in place of a noun, 10 errors with using wrong preposition or articles, and 1 error with using the definite article before genitives. Table 5. Number of errors and the errors agreed by the evaluators at the grammatical level

Grammatical Errors
Number of agreed errors EVAL 1 EVAL 2 EVAL 3 Violating the subject-verb agreement (masculine and feminine; singular, dual, and plural; first, second, and third person) Using a noun in place of a verb 3 4 7 1 Using a verb in place of a noun 3 1 4 4 Using wrong prepositions, articles, and particles 10 5 8 12 Using definite articles before genitives 1 1 0 0

English (EN):
The leaders said resolving the conflict is the only way to ensure peace in the region, urging the international community to take action to put a stop to the building and expansion of illegal settlements. In Arabic grammar, the subject should agree with the verb in terms of gender and number. The word ‫حث‬ is a singular verb that does not agree with its subject "leaders." The plural suffix ‫"واو"‬ and ‫"ا"‬ should be added to the ijel.ccsenet.org

‫وقال‬
International Journal of English Linguistics Vol. 10, No. 4;2020 54 verb ‫حث‬ to agree with its subject.

English (EN):
We need to get back 0 to have frank and demanding discussions on Iran's nuclear, regional and ballistic activities but also to have a broader approach than sanctions.

‫العقوبات.‬ ‫من‬
The Arabic translation is not clear, as the sentence starts directly with a noun without prior information about it. In this particular situation, a verb should be added to the Arabic sentence to clarify the meaning.

8) Example of an error related to using a verb in place of a noun
English (EN): Protesters-many of them high school and university students-jumped turnstiles, attacked several underground stations, started fires and blocked traffic, leaving widespread damage across the city and thousands of commuters without transport.

Arabic (ARB):
‫المتظاھرون‬ The word "leaving" is translated as a singular verb ‫ترك‬ , which distorts the meaning of the Arabic sentence, as it does not have any clear subject.

9) Example of an error related to using wrong prepositions, articles, and particles
English (EN): The trade agreement did not mention car tariffs of up to 25%, which were previously threatened by the US. The adverb "previously" has been translated as ‫قبل‬ ‫,من‬ and this not the correct translation of this adverb, especially when both words "previously" and "by" were translated the same as ‫قبل‬ ‫من‬ in the same sentence.

English (EN):
The Australian town of Kingaroy in Queensland was hit by a fierce dust storm on Thursday, with winds reaching up to 90km/h (56 mph). The indefinite English article "a" is translated as a definite article in Arabic using the prefix ‫."ال"‬ This translation corrupts the structure and the meaning of the Arabic translation.

Arabic (ARB):
Finally, Table 6 shows the evaluators' agreement at the semantic level. They agreed on 11 errors with using ambiguous words, 1 error with using terms of different meaning, 1 error with incorrect collocations, 9 errors with using a wrong reference, 4 errors with adding unnecessary words, 0 errors with omitting necessary words, and 20 ijel.ccsenet.org International Journal of English Linguistics Vol. 10, No. 4;2020 55 errors with corrupting the meaning of the entire sentence. The term "Brexit" refers to the withdrawal of the UK from the European Union. In the Arabic translation, the word has been transliterated without any explanation. The word "divorce" has been translated to Arabic literally as "ending up a marriage." However, the word ‫,انفصال‬ which means "separation," is more appropriate in this context. Although the words "produce" and "documents" collocate with each other in the English language, they do not collocate in Arabic. Therefore, a better collocation should be used in Arabic to achieve idiomaticity. 14) Example of an error related to using wrong reference and relative pronouns English (EN): The missile-which was able to carry a nuclear weapon-was the North's 11th test this year. But this one, fired from a platform at sea, was capable of being launched from a submarine. The indicative article "this" refers to "missile" in the English sentence. However, "this" refers to number one instead of "missile" in the Arabic translation.

‫الصاروخ‬ ‫كان‬
15) Example of an error related to adding an unnecessary word, preposition, or article before a word The phrase "Washington swamp" is a metaphor used by politicians in the US to refer to corruption. This phrase has been translated literally, producing a meaningless phrase in Arabic.
In conclusion, the above tables show the number of errors by each evaluator and present the errors that evaluators agreed on. For example, in terms of mistranslation, evaluator 1 identified 35 errors in the MT output, evaluator 2 identified 40 errors, and evaluator 3 identified 46 errors. After the three evaluators discussed and shared their evaluation of the data, they agreed on 40 errors, as shown in Table 2 (first column).
Different types of errors received a different amount of agreement. For instance, the evaluators declared that orthographical errors can be detected easily, as it was easy for them to identify the location of the error; as a result, they easily agreed on 26 errors. However, the evaluators did not agree greatly on the category "adding an unnecessary word, preposition, or article before a word," as they found it hard to decide which words are unnecessary. In this case, evaluator 1 identified 10 errors, evaluator 2 identified 4 errors, and evaluator 3 identified 0 errors. However, they only agreed on four errors in the MT output.

QA of Adequacy and Fluency
Using the same dataset, we calculated the evaluators' QA of adequacy and fluency on a 5-point scale. After analyzing the data, the overall statistics in Table 2 shows the average adequacy and fluency scores for each evaluator. Hence, we show the average scores given by each of the three evaluators for adequacy and fluency in their evaluation of the Google Translator output from English to Arabic.  Evaluation of adequacy in the translation from English to Arabic showed an excellent consistency, as the three evaluators provided scores in the range of 69-70-71 with an average score of approximately 70%. Similarly, the three evaluators exhibited a reasonable amount of consistency in terms of the evaluation fluency, as they provided scores in the range of 70-76-84 with an average score of approximately 77%, as shown in Table 10. According to our results, we conclude that the most dominant errors in the MT output were mistranslation errors, followed by corruption of the overall meaning of the sentence and then orthographic errors. In terms of the QA of adequacy and fluency, the results were 70% for accuracy and 77% for fluency. Therefore, according to these results for English-to-Arabic translation, Google Translate produces sentences with relatively few errors, and the translated text is fluent to some extent.

Conclusion
In this study, we have conducted a fine-grained manual evaluation to identify and present the dominant types of translation errors produced by Google Translate. The final results suggest that the existing errors in the MT output are mainly related to mistranslations, corruption of the overall meaning of a sentence, and orthographic errors. Moreover, according to the results of our evaluation, Google Translate produces sentences with relatively few errors in English-to-Arabic translation, and the translated text is fluent to some extent. These results can help other researchers in the field to examine these three types of errors more closely and, thus, explain the reason behind the failure in translation at these three levels. From an information technology perspective, it seems that there is a need to develop a more intelligent translation software that considers the context of texts in the translation process. Also, further research is needed to complement the findings of the current one; the use of MT in translating specialized texts might show different weaknesses. Finally, we believe our empirical findings represent a significant contribution to the field of evaluating and improving Google Translate if the current results of errors analysis for Arabic-English languages are taken into consideration.