Vocabulary Use in Doctoral Theses : A Corpus-Based Lexical Analysis of Academic Word List ( AWL ) in Major Scientific Disciplinary Groups

Since the development of academic word list (AWL) by Coxhead (2000), multiple studies have attempted to investigate its effectiveness and relevance of the included academic vocabulary in the texts or corpora of various academic fields, disciplines, subjects and also in multiple academic genres and registers. Similarly, this study also aims at investigating the text coverage of Coxhead’s (2000) AWL in Pakistani doctoral theses of two major scientific disciplinary groups (Biological & health sciences as well as Physical sciences); furthermore the study also analyses the frequency of the AWL word families to extract the most frequent word families in the theses texts. In order to achieve this goal, a pre-built corpus of Pakistani doctoral theses (PAKDTh) (Aziz, 2016) comprises of 200 doctoral theses from two major scientific disciplinary groups was used as textual data. Using concordance software AntConc version 3.4.4 (Anthony, 2016), computer-driven data analysis revealed that in total 8.76% (496839 words) of the text in Pakistani doctoral thesis corpus is covered by the AWL words. Further distributing the analysis per sub-lists, shows that the first three sub-lists of AWL accounted for almost 57% of the whole text coverage. An attempt was made to further analyze the AWL text coverage by considering the frequency of occurrences in terms of word families. The findings showed that among 570word families of Coxhead’s (2000) AWL, 550-word families with the sum of 96.49% are found to occur more than 10 times in PAKDTh corpus, which are taken as word families used in the corpus. This study concludes that Coxhead’s (2000) AWL is proved effective for the writing of theses. On the basis of the findings, further possible academic implications are discussed in detail.


Introduction
Pakistani students, being non-native speakers of English and belonging to ESL context, at all academic levels are assumed to have very limited vocabulary knowledge which might be a factor influencing their proficiency in academic discourse (Mozaffari & Moini, 2014).There might have been less focus on teaching vocabulary by language teachers which could possibly be the main factor for students' lack vocabulary knowledge.Coady (1997) reports that these kinds of practices by ESL teachers are only because of the traditional language teaching practices (with negligence of vocabulary) which they have experienced during their earlier learning period.Similarly, According to Macaro (2003), language teachers from ESL context often neglect this area (vocabulary) of language and they must be provided proper research-based practices to incorporate the component of vocabulary in their teaching.Another most important factor is learning resources and curriculum (Fan, 2003;Warsi, 2004) which hinder language teachers to do so.Despite all the facts, students learning English also find vocabulary as one of the most important areas to be achieved.Leki & Carson's (1994) survey also provided evidence for students' serious attitude towards vocabulary learning.As students advance to upper academic levels, by the expansion of more subjects and textbooks they experience more vocabulary (Nagy & Anderson, 1984;Stahl, 1998;Schmitt, 2000;Biemiller, 2005;Stahl, 2005) feel themselves surrounded by a vast variety of texts containing specific vocabulary.Thus, they mainly focus on learning the vocabulary which is specified and specialized to their courses and subjects.Hence, it is suggested by Nation "… to direct vocabulary learning to more specialized areas when learners have mastered the 2000...3000 words of general usefulness in English" (2001, p. 187); but it is not always effective for learners whether they are native or non-native speakers of English.There might be chances for students to master vocabulary in general or specifically but become less acquainted with the vocabulary that they may require for better achievement in academics and for the effective understanding of academic discourse at higher levels.Subsequently, academic vocabulary, occurring less frequently than general vocabulary items (Worthington & Nation, 1996;Xue & Nation, 1984), seemed difficult for learners (Cohen et al., 1988) because of their more familiarity with technical or specialized vocabulary in comparison with that of academic.Thus, it is very crucial to take academic vocabulary development into consideration while teaching English to the students at any academic level (such as primary, elementary, secondary, higher secondary or tertiary).
The multitudinous advancement in technology and its role in linguistics cannot be unappreciated.So that, the recent development of corpus linguistic research, particularly in English for specific purposes (EAP) and English for academic purposes (EAP) are widespread.ESP or EAP practitioners and researchers find it the most valuable in linguistic research which helps them develop better and explicit knowledge about language.Corpus linguistics is generally defined as an approach and research method in linguistics rather than a branch of linguistics which empirically examines natural languages through corpus-based techniques using computers (McEnery & Wilson, 2001).The use of corpus linguistics is also widespread in vocabulary research studies.Coxhead (2000) used an academic corpus of 3.5 million words and attempted to construct the list of the academic word list (AWL).
The current study aims at investigating the use of AWL words in Pakistani doctoral thesis.It also attempts to examine AWL words use distinctively between two major disciplinary groups of Engineering & Technological, Biological, and Health Sciences.
The current paper attempts to answer the question given below: 1) What is the text coverage of AWL words in the corpus of Pakistani doctoral theses (PAKDTh)?
2) What are word families of AWLfrequently used in Pakistani doctoral theses?

Review of Literature
The corpus, compiled by Coxhead (2000) for making AWL, comprises texts from diverse academic sources (such as academic journal articles, academic web articles, textbooks, course books, scientific texts and laboratory manuals) of 28 subject areas from four major academic disciplines of arts, law, science, and commerce.Since the development of AWL, various research attempts have been made to determine its effectiveness across various academic fields and disciplines.The AWL contains 570-word families in total.The word families are referred to the root word which has different word forms such as assume, assumed, assumes, assuming, assumption and assumptions.
The review of the literature indicates that there has been fewer research studies focusing the use of AWL in particular disciplines and fields.Such as, Chung & Nation (2003) comparatively studied the use of Coxhead's (2000) AWL & West's (1953) General service list (GSL) in applied linguistics and anatomy books, Mudraya (2006) analysed AWL use in Student Engineering English corpus of 2 million words, Chen & GE (2007) did a lexical analysis on medical research papers corpus, Vongpumivitch et al. (2009) analysed the frequency of Coxhead's (2000) AWL word families in the corpus of Applied linguistics research articles, and Martinez ( 2009) critically investigated Coxhead's (2000) AWL word families in agriculture research articles.
The importance and effectiveness of Coxhead's (2000) AWL have been brought under discussion by the various researchers (e.g., Chung & Nation, 2003;Mudraya, 2006;Chen & Ge, 2007;Vongpumivitch et al., 2009).While some studies (Martinez, 2009;Mozaffari & Moini, 2014) do not consider the AWL as an effective source in terms of text coverage in the specific fields and disciplines.This study mainly tries to explore the usefulness of the AWL word forms in education research papers as well as an attempt is made to extract the non-AWL words frequently appear in education research.To this end, the relatively large corpus of education research papers was compiled.
This study attempts to analyze the use of academic vocabulary through the AWL in the doctoral theses of two distinct scientific disciplinary groups and also tries to extract the most frequently used AWL word families in the theses.In order to achieve this aim, a pre-built corpus of Pakistani Doctoral theses (PAKDTh) is used as textual data representing the written English language of theses as an academic genre and a non-native variety of written academic English.

Data Analysis
Coxhead's (2000) AWL word families, which are categorized into 10 sub-lists on the basis of frequency, were retrieved from the internet.The sub-lists were saved separately in notepad files including all the word families and their forms.The lists were modified to create lemma list files of the sub-lists so that, they may be used in AntConc 3.2.4 a concordancing program) for generating lemma search result lists for frequency counting based on AWL word families (head words) and their forms (lemmas).
The 570 AWL word families comprise of 3111 lemmas (word forms).The sub-corpora of PAKDth corpus for both the major disciplinary groups Biological & health sciences (BHSc) and Physical Sciences (PhSc) were separately loaded into the concording program (AntConc).Lemma search feature of Antconc 3.2.4 was employed to generate lemmatized frequency lists of both the sub-corpora.The search results were transferred to separate Microsoft excel (spreadsheet) files for both the groups and frequency counts were calculated to generate results to analyze the use of AWL in sub-corpora as well as in PAKDTh corpus.The results and findings are discussed in the next sections in detail.

Coverage of AWL in PAKDTh Corpus
The analysis of AWL words in PAKDTh corpus reveals that in total 8.76% of the text in Pakistani doctoral thesis corpus is covered by the AWL words.As shown below in Table 2, the occurrences of 496839 words were found the whole PAKDTh corpus.Similarly, in each of disciplinary groups' sub-corpora (BHSc & PHSc) the AWL's coverage is almost similar to the accumulative text coverage percentage with 8.62% (251879 words) of BHSc texts and 8.91% (244960 words) of PHSc texts.These findings of the current analysis show relative effectiveness of AWL words in both disciplinary groups of science and the written academic genre of doctoral theses.According to Coxhead's (2000) analysis, the text coverage of AWL in Science sub-corpus, which included texts from the subject areas of biology, chemistry, computer science, geography, geology, mathematics, and physics, was 9.1%.So, the use AWL words in PAKDTh corpus and its sub-corpora is almost close to that of Coxhead's (2000).It is worth notable, that the counts for AWL coverage analyzed in this study include the occurrences of all the AWL word families and their forms, but the counts are not filtered on the basis of range and frequency criteria which is employed by Coxhead (2000) for the development of AWL.Following such criteria, the results for AWL coverage of PAKDTh corpus may vary suggestively.Observing the AWL text coverage in terms of token type (word forms), sub-lists 1-9 covers 96.73% of 2443 word forms/token types of AWL found in PAKDTh corpus which is 78.55 % of 3110 the total token types included in AWL word families and 1.79 % of 135789 the total token types of PAKDTh corpus.

Frequency of AWL in PAKDTh Corpus
The second objective of this study was to analyze the frequency of AWL word families in Pakistani doctoral theses.So, an attempt was made to further analyze the AWL text coverage by considering the frequency of occurrences in terms of word families.To answer the research question 2, the frequencies of the AWL word families in the entire PAKDTh corpus were calculated and arranged on the basis of the frequency of occurrences in the corpus which are given in Table 4.
The analysis on the basis of the frequency of occurrences shows that among 570-word families of Coxhead's (2000) AWL, 550 word families with the sum of 96.49% are found to occur more than 10 times in Pakistani doctoral theses corpus (PAKDTh), which are taken as word families used in the corpus.Whereas, only 19-word families with 3.33 % were found occurring less than 10 and between 1-9 times and only 1-word family was found with 0 occurrences, both of these word families can be regarded as the word families not frequently used in PAKDTh corpus.In this study, the word "analyze" was found to be the most frequently used AWL word family with 10442 frequency in PAKDth Corpus.Other AWL word families such as significant, react, method, found, extract, concentrate, data, differ, conflict and positive were also observed with high frequency in the corpus.Most importantly, the majority of the AWL word families 136 (23.86%) are found with the frequency more than 1000 times in PAKDTh corpus, which shows the importance of the academic vocabulary included in Coxhead's (2000) AWL in the texts of doctoral theses.The top 100 most frequently used AWL word families found in Pakistani doctoral theses are listed in Appendix A.
It is important to note that there is a significant difference between the results of this study and Coxhead's (2000) arrangement of AWL word families into the sub-lists (1-10) on the basis of the frequency of occurrences.There are various AWL word families which are positioned as high-frequency words in Coxhead's (2000) sub-lists of AWL were not found to occur with the relevant frequency in this study in comparison with Coxhead's analysis and vice versa.For instance, such words as authority, contract, export, finance, labour, legal, legislate were included in sub-list 1 of Coxhead's (2000) AWL, because they were found with high frequency in Coxhead's (2000) academic corpus, but the frequency of occurrences of these words in PAKDTh corpus is highly less ranging from 9 to 86 occurrences.However, certain word families which are infrequent in Coxhead's (2000) analysis, such as detect, exhibit, induce, intense, nuclear, radical, found, mature, medium, and so-called, which are included in the sub-lists 8, 9 and 10 of AWL, seemed to be from the topmost frequent word families (Appendix A) in the analysis of PAKDTh corpus with frequency of occurrences ranging from 1211 to 7335.
Only one AWL word family of the word compound does not occur in PAKDTh corpus.Taken all together, it can be assumed that this difference might be due to the texts included in PAKDTh corpus which has only been taken from two scientific disciplinary groups of (BHSc and PHSc) rather than the inclusion of other disciplinary groups such as arts, humanities, and social sciences.

Conclusion
The current study was an attempt to analyze the frequency and coverage of academic vocabulary in scientific doctoral theses texts using a corpus.For this purpose, a pre-built corpus of Pakistani doctoral theses (PAKDTh) (Aziz, 2016) was taken, which comprises of 200 Pakistani doctoral theses from two major scientific disciplinary groups of (biological & health sciences) and (physical sciences) covering 17 distinct disciplines and subject areas.
The study reveals that the text coverage of AWL word families in the scientific doctoral theses corpus was 8.76% which indicates the effectiveness and importance of Coxhead's (2000) academic word list in the academic genre of theses and also in the two disciplinary groups (sub-corpora) of science.Furthermore, the findings of the analysis also revealed that the first three sub-lists of AWL accounted for almost 57% of the whole text coverage.Simply, it can be concluded that the word families included in the first three sub-lists of AWL play very important role in the coverage of AWL in the doctoral theses of sciences or PAKDTh corpus.
The results of this study also reveal that 550 world families (96.50%) among the total 570-word families of Coxhead's (2000) AWL are found to be frequently used in the doctoral theses of scientific disciplinary groups.
On the basis of the findings of this study, all the individuals concerned with academic and scientific writing learning and instructions (e.g., novice researchers, EAP learners & teachers, research writers, writing instructors and course books and material designers) are suggested to rely upon the use and effectiveness of vocabulary included in Coxhead's AWL.The AWL can highly be considered as one of the most reliable sources for the development, learning, and teaching of academic vocabulary, specifically at higher secondary and tertiary level.

Table 1 .
comprises of 200 texts of Pakistani doctoral thesis from 17 disciplines categorized into two sub-corpora of major disciplinary groups PHSc and BHSc.PAKDTh contains 200 theses, 100 from each group.The size of PAKDTh corpus is approximately 5.6 million words.The exact number of words and disciplines included in the corpus are shown in the table given below: PAKDTh corpus description Adapted from "Linguistic Variation across Major Disciplinary Groups of Pakistani Academic Writing: Multidimensional Analysis of Doctoral Theses" by Aziz, Pathan, & Ali (2016), ARIEL-An International Research Journal of English Language and Literature, 27, 27-60.

Table 2 .
Coverage of AWL words in PAKDTh CorpusThe text coverage of AWL words in PAKDTh corpus distributed per sub-list (from sub-list 1-10) is provided in table3.The results, distributed per sub-list, show that the coverage of the AWl words (included in sub-lists 8, 9 and 10) is significantly less than those which are included in sub-lists 1 to 7. The greater part of AWL is covered by sub-lists 1, 2 and 3 with 28.38%, 15.74% 12.72% respectively.

Table 4 .
Frequency of occurrence for AWL Word families in PAKDTh Corpus