A Methodological Review of Machine Learning in Applied Linguistics

The traditional linear regression in applied linguistics (AL) suffers from the drawbacks arising from the strict assumptions namely: linearity, and normality, etc. More advanced methods are needed to overcome the shortcomings of the traditional method and grapple with intricate linguistic problems. However, there is no previous review on the applications of machine learning (ML) in AL, the introduction of interpretable ML, and related practical software. This paper addresses these gaps by reviewing the representative algorithms of ML in AL. The result shows that ML is applicable in AL and enjoys a promising future. It goes further to discuss the applications of interpretable ML for reporting the results in AL. Finally, it ends with the recommendations of the practical programming languages, software, and platforms to implement ML for researchers in AL to foster the interdisciplinary studies between AL and ML.


Introduction
The past few years have witnessed the increasing awareness on the importance of statistical methods in linguistics (Khany & Tazik, 2019;Nikitina & Furuoka, 2018;Norris et al., 2015). The reason might be the fact that statistical approaches play a vital role in investigating the variables in linguistics. One of the most commonly used methods is linear regression. But this algorithm suffers from the handicaps of strict assumptions including normality, linearity, and homoscedasticity (Plonsky & Ghanbar, 2018). This gives rise to the applications of more advanced algorithms to tackle more complicated linguistic problems. These new technologies, to a large extent, are represented by machine learning (ML). However, this is no systematic review of the applications of ML in applied linguistics (AL). Most of the reviews on the methods employed in AL are on traditional methods, for example, linear regression. Besides, how to report and interpret the result of ML model remains elusive to most of the researchers in applied linguistics. The introduction of corresponding ways to exploit the ML is needed. Therefore, this paper attempts to fill in these gaps by summarizing the applications of ML in AL, introducing the interpretable ML, and presenting suggested approaches to implement ML for researchers in AL.

Definitions of Keywords
Machine learning refers to the process of figuring out the underlying pattern of data by computers automatically instead of designing any man-made rule presumably. ML can be classified into two categories: supervised learning and unsupervised learning. The definition of AL adopted in this essay refers to the studies on language and language-relevant problems in which people use or learn languages as what was defined by Lei and Liu the concern over the application of this method in psycholinguistics. Nicklin and Plonsky (2020) delineated the problem of outliers in L2 research and they summarized the present methods adopted to deal with outliers. As for factor analysis, Plonsky and Gonulal (2015) did a review of the exploratory factor analysis in linguistic research. Lindstromberg (2016) reviewed inferential statistics in English language teaching research and emphasized the unsuitability of p values. Norris (2015) conducted a review on the statistical significance testing in L2 research and he figured out the problem of statistical testing and argued for the directions to reform. King and Mackey (2016) summarized the developmental trend of the research methodology employed in L2 research. Paquot and Plonsky (2017) conducted a review on the quantitative methods in corpus linguistics, including ANOVA, factor analysis, and resampling. Moreover, researchers, for example, Norouzian (2020), also researched the sample size planning problem in second language research. Nikitina and Furuoka (2018) discussed the application of quantile regression with bootstrapping to deal with the non-normal data and outliers. These reviews and research made great contributions to our understanding of the methodology in applied linguistics. But the problem is that these studies focus on traditional methods exclusively. The problem of strict assumptions has not been solved and the demand for more advanced methods to tackle more complicated problems in applied linguistics has not been met. This leaves the room to investigate machine learning in applied linguistics.

Brief Introductions to ML
What follows below is the introduction of the typical algorithms in ML including logistics, K nearest neighbor, Bayesian model, support vector machine, random forests, XGBoost, clustering, and neural networks. Logistics is actually a generalized linear model. K nearest neighbor is somewhat regarded as the simplest algorithms in ML for its main idea lies in calculating the geometric distance and then finding out the best result. The Bayesian model is founded on probability. Support vector machine is the representative algorithm in ML before 2005. Its algorithmic idea is the projection of original data into higher dimensional space in which the data can be effectively tackled. Random forests and XGBoost algorithms are two major algorithms in ensemble learning. The former is based on bagging and the latter is based on boosting. The neural network is the imitation of the human brain and it is renamed deep learning in recent years, which stands for the state-of-the-art technology in AI. It has several different versions including recurrent neural networks, long short-term neural networks, convolutional neural networks, etc. (Barredo Arrieta et al., 2020).

Motivations to Review the Applications of ML in AL
The rationale for reviewing the applications of ML in AL can be summarized as follows. The first motivation has something to do with the drawbacks of linear regression. The traditional linear regression has several flaws, including normality assumption, linearity assumption, collinearity, and robustness.
First of all, linear regression is based on the normality assumption. This assumption comes from the theory of central limited theorem, but it turns out to be problematic when the data deviate from the normal distribution. It has been a long time since scholars argued for more advanced methods that do not rely on normal distribution for linguistic studies (Nicklin & Plonsky, 2020). According to a review by Hu and Plonsky (2019), there are a large number of L2 studies that did not follow assumption-check strictly. This means that the research on L2 without the normality-check may be problematic to some extent.
In addition, linear regression is based on the linearity assumption, which may depart from the nonlinear development pattern according to the Complex Dynamic Systems Theory in AL (Lesonen et al., 2020). It cannot wrestle with nonlinear data. More importantly, linear regression fails to calculate the importance of variables when predictors are collinear. Some variables have to be deleted in this case (Tomaschek et al., 2018). This poses a problem to the research on applied linguistics (Wurm & Fisicaro, 2014). Besides, the traditional methods such as linear regression are sensitive to the influence of outliers which may lead to some mistakes in the final results. Moreover, linear regression is built up upon the ideal condition in which the independent variables are calculated without any interaction effect. The final disadvantage of linear regression is the side effect brought by significant testing. We cannot make the final decision on whether one variable has any relationship with another simply by significance testing. Table 1 is the summary of the comparisons between traditional linear regression and ML-based methods Another motivation for this synthesis resides in the fact that ML, particularly the neural network, was criticized for its opacity. The applications of ML in AL, for example, automatic scoring, were blamed for ML-based systems cannot be interpreted. A review on how to explain the results of ML is necessary. Aside from the aforementioned two reasons, the third motivation of this review is justified by the learning curve of ML. Due to the complexity of ML, linguistic researchers may have difficulty in implementing the ML to solve the problems in AL. For this reason, this thesis will also cover the most user-friendly programming language, software, and platform for researchers in AL.
With these three motivations taken into account, this thesis is going to answer the following questions: Question 1: What contributions has machine learning made to different branches of AL?
Question 2: What about the interpretability of ML in AL?
Question 3: What is the suitable approach for the researchers in AL to make use of ML?

Inclusion Criteria
The inclusion criteria are listed as follows: First, ML should be adopted to solve relevant problems. Second, the problem should have something to do with applied linguistics. Third, those essays which can shed light on how to apply ML to AL are also included.

Literature Research
Google scholar and web of science are utilized to collect the data. First of all, the author tries to search the keywords related to ML in Web of Science with the list of linguistic journals ranked according to their impact. The ranking was done by Web of Science and further information about this can be checked online. These keywords include K nearest neighbor, naïve bay and Bayesian networks, support vector machine, random forests, XGBoost, neural networks, clustering, machine learning, data mining, and artificial intelligence. In order to capture all the related literature, google scholar is also used to cover relevant literature. Moreover, the exemplary studies from adjacent disciplines are also added, but these studies only account for a very small proportion. The reason for the inclusion is that these studies can enlighten us on how to carry out cross-disciplinary studies between AL and ML. Most of the papers are from the Social Science Citation Index (SSCI) journals.

Exclusion Criteria
As this thesis focuses on the applications of ML in AL, the papers should be related to both ML and AL at the same time. If the essay concentrates on only one aspect, it will be eliminated.

Data Coding and Grouping
After the iteration of researching and eliminating, all the essays are read by the author one by one. As there are many branches of AL, the results are grouped based on similarity. Some of the branches of AL are with no application according to the search results and these branches will not be reviewed. Table 2 shows the selected essays which comply with the aforementioned criteria. The similar branches of AL are grouped as one item. Most of them will be discussed in detail and some will not for brevity.  (2016) implemented CDA by the neural network. From what has been discussed above, it can be learned that automatic essay scoring and automatic speech scoring which is based on the ML technology is likely to be a new trend in the future and student's competence might be diagnosed more and more precisely with the advance of ML.

Second Language Research
In second language research (L2), probably, the most frequently mentioned ML algorithm is Bayesian analysis. Gudmestad et al. (2013) delineated how Bayesian analysis can be applied in second language acquisition. Another study that is similar to Gudmestad was done by Norouzian et al. (2018). In their research, Bayesian networks were shown as a powerful tool in second language research and they argued for a Bayesian revolution. Apart from the Bayesian networks, clustering was proven very useful in second language research. Papi and Teimouri (2014)  What can be learned from above is that Bayesian analysis has been exploited by linguists in second language research. Clustering can help us group students into different categories based on which geared language teaching and learning might be possible.

Corpus Linguistics
Within the scope of corpus linguistics, perhaps, the most frequently adopted algorithm is random forests. A methodical review of random forest on corpus linguistics was done by Th Gries (2020). Fonteyn and Nini (2020) employed the random forest algorithm and conditional inference trees to analyze gerunds. Deshors (2020a) applied random forests to investigate the contextualized past tense and the interactions between variables. Deshors argued that this method can overcome the assumption of normality as what had already been mentioned in the literature review. In 2020, he did another research by random forests to investigate multi-speaker interactions. Frey (2020) provided a very comprehensive overlook of the algorithms of ML in corpus linguistics. Apart from random forests, clustering was introduced in corpus linguistics by Hilpert (2016). Moreover, Sung et al. (2015) applied support vector machines to classify the readability of second language reading texts with reference to the Common European Framework of Reference (CEFR). H. Kang and Yang (2020) quantified the political bias using machine learning. Ballier et al. (2020) reviewed a Kaggle competition which employed machine learning algorithms and natural language processing techniques to automatically score essays. It seems that the random forest algorithm is the most frequently used algorithm in corpus linguistics and other algorithms of ML also show great potentials.

Clinical Linguistics
In the field of clinical linguistics, the contribution of ML is mainly embodied in the application of diagnosis of language disorder. Geetha et al. (2000) employed artificial neural networks to classify childhood disfluencies using neural networks with 92% accuracy. Logistics is applied by Reed and Wu (2013)

Psycholinguistics and Neurolinguistics
In the area of psycholinguistics and neurolinguistics, one of the most important contributions ML has made is the implications of the neural network framework for language processing models. Concretely speaking, the architecture in neural networks can shed light on the language learning models, especially in bilingual studies. Zhao and Li (2010), for example, talked about the bilingual lexical interactions from the perspective of neural networks. Monner et al. (2013) explained the language learning phenomenon from the perspective of neural networks. Frank (2020) discussed how the recurrent neural networks can enlighten us on multilingual sentence processing models. It can be learned from above that ML algorithms, especially neural networks, have gradually received attention from the researchers specializing in psycholinguistics or neurolinguistics. One plausible reason may be that the neural network is the convergent point where both psycholinguists or neurolinguistics and computer scientists are focusing on. And the researchers from both sides seek to draw on the strength from each other. Furthermore, magnetic resonance imaging (MRI) is currently adopted by neurolinguistics. It is possible that the research on the diagnosis of language disorder and Alzheimer's problem by MRI can be enhanced with help of ML. One typical example was done by Basaia et al. (2019). Heikel et al. (2018) recorded the evolution of the neurocognitive process by a machine learning method called multivariate pattern analysis. Pearl and Enverga (2014) developed a mind-print-based system by machine learning to identify the mental state. Munsell et al. (2019) applied machine learning algorithms to predict the performance of naming in temporal lobe epilepsy. Fromont et al. (2020) applied random forests to model the individual data and found that language exposure and proficiency were the most important predictive variables. All in all, the neural network and ML algorithms may show a bright future in psycholinguistics and neurolinguistics.

Phonetics
In phonetics, the major contribution of ML is mainly embodied in acoustic feature importance ranking and automatic speech recognition system. Al-Tamimi and Khattab (2018) employed both random forests and linear mixed models to find out the most predictive indicators for distinguishing different acoustic stops. Przybyla and Teisseyre (2014) analyzed the utterances to train a regressor to predict the speaker's background by several ML algorithms. The results showed that random forests and k nearest neighbor algorithm outperformed other algorithms. Arnhold and Kyrolainen (2017) investigated the focus marking by both random forests and the generalized additive mixed algorithm with the spotlight on the variable importance. This can help us understand phonetics and develop a speech scoring system that can be applied either in language testing or clinical linguistics, etc. Support vector machine and linear discriminant analysis are employed by Howell et al. (2017) to train a classifier by different kinds of speeches. Bybee and De Souza (2019) analyzed the vowel duration in two different constructions by random forest analysis based on conditional inference trees. It can be seen from above that ML can also be applied to phonetics as long as the original acoustic data can be digitalized. After that, researchers in phonetics can establish a model by ML to solve related problems.

Results of Question 2
After training a classifier or a regressor, the accuracy or confusion matrix marks can be obtained. But sometimes we are far more interested in the importance of input variables. This is a problem on the interpretability of ML. Admittedly, the applications of ML in AL, for example, automatic scoring, was criticized for the drawback that the system cannot be interpreted. Therefore, the problem of interpretability will be discussed in this part.
To begin with, the random forest algorithm might be the most popular algorithm adopted by researchers in AL. The underlying reason might be the fact that the information of feature importance can be informed. More importantly, it enjoys great suitability whether the data follow the normal distribution or not. It is still applicable when predictors are collinear and works for the data with large numbers of predictors and limited samples (Matsuki et al., 2016). Interaction effect will also be taken into consideration by random forests (Baumann & Winter, 2018). Here are some linguistic studies by random forest algorithm. Her and Tang (2020), for example, ranked the feature importance by random forest to understand the predictive power of input variables. There are also some other similar cases, such as Deshors (2020b), and Wiechmann and Kerz (2014). As for decision trees, the result of decision trees can be visualized. This can help us understand how the system works. For example, Fromont et al. (2020) illustrated the effect of individual variability by visualizing decision trees.
However, neural networks cannot be explained as easily as decision trees or random forests. It is for this reason that some automatic scoring system based-on ML in applied linguistics was criticized. As a matter of fact, there is an alternative method called Shap value (Ribeiro et al., 2016) which can explain neural networks. Actually, Shap can explain any classifier or regressor. Frey (2020) had introduced this method in his doctoral dissertation on corpus linguistics. With this method, the automatic scoring system by deep learning can be validated by linguistic practitioners on the one hand. It can also, on the other hand, offer precious guidance for us. This method is hardly adopted by linguists but it was already applied in other science disciplines. Further information on explainable ML could be learned from the paper by Barredo Arrieta et al. (2020). Plots of partial dependence and individual conditional expectation are also very useful methods to peek inside the black-box. Further information can be found in the essay by Adadi and Berrada (2018). Studies on AL in the future should put the interpretable ML into full use.

Results of Question 3
After reviewing the applications and interpretability of ML, the following question is how to make it possible for all the linguistic researchers to have access to these techniques. This question will be answered from three aspects: programming languages, software, and platforms for ML.
To begin with, the most recommended programming language for ML should be Python language followed by R language. Python is ranked as the third most popular language in computer science and it is gaining more and more popularity for its simplicity and convenience. It is open freely to the public. R language is also very popular in both computer science and AL (Mizumoto & Plonsky, 2016). As for the library for implementing the ML algorithms, Sklearn definitely is the best candidate for it enjoys numerous mighty functions for all kinds of algorithms (Hao & Ho, 2019) and Keras based on TensorFlow is a flexible and powerful tool to carry out neural networks (Pang et al., 2019). Pytorch is also attracting more and more users in recent years, especially in the academic circle. The recommended platform for R language is RStudio. The recommended platform for Python language is Spyder or Jupyter notebook supported by Anaconda.

Conclusions
As for the contributions ML has made to AL, they can be further summarized as follows: computer-assisted language learning and language teaching, identification, diagnosis, automatic scoring, and feature importance ranking. Language learning and language teaching will be more adaptive and personalized with help of ML. It can also help us identify the test fraud automatically and provide fined-grained information about the student's ability. Automatic scoring based on ML makes it possible to increase the reliability and validity of scoring in language testing. The diagnostic system based on ML plays an important role in clinical linguistics because ML makes it possible to diagnose the patient with the language disorder problem within a short period of time. Finally, ML algorithms show greater suitability than traditional linear regression. Random forests, for example, can deal with complex data types notwithstanding the normality, linearity, and collinearity assumptions. And the interaction between variables can also be investigated. Further studies in AL should make use of interpretable ML to gain more information from data. From the viewpoint of question type, the problems that can be solved by ML include regression problems, classification problems, clustering problems, and dimensionality reduction problems. Researchers in AL should pay more attention to these four kinds of problems in which interdisciplinary studies between ML and AL can be carried out. It seems that ML is applicable in most branches of AL. The data in AL include digital numbers, natural language, pictures, and other objects as long as they can be digitalized. And it might help us solve AL problems towards the trend of automatization. Most importantly, ML might continue to help researchers in AL delve into and deal with the more perplexing linguistic problems that traditional methods cannot solve (Gass et al., 2020). All in all, the application of ML in AL holds an advantage over traditional methods for its superiority in accuracy and flexibility.

Limitations and Directions for Further Studies
Regardless of various efforts to circumvent possible flaws, this thesis still has the following weaknesses. Primarily, there is no clear cut between different sub-fields of AL branches. The essay assigned to one branch of AL can also be classified into another. Studies in the future can overcome this defect by setting up clear and fine-grained standards. Furthermore, this essay focuses on the finite range of linguistic journals. Some related meeting papers and doctoral dissertations may be missed although some typical studies from the outside circle of AL are included. Finally, this thesis concentrates on limited branches of AL. But this does not mean that ML is inapplicable to other branches. Brown et al. (2014), for example, applied random forests to solve pragmatic problems. Studies in the future can sweep all the possibly related papers to depict a more panoramic picture.