The Design and the Construction of the Traditional Arabic Lexicons Corpus (The TAL-Corpus)

Arabic lexicography is a well-established and deep-rooted art of Arabic literature. Computational lexicography, invests computational and storage powers of modern computers, to accelerate long-term efforts in lexicographic projects. A collection of 23 machine-readable dictionaries, which are freely available on the web, were used to build the Corpus of Traditional Arabic lexicons (the TAL-Corpus). The purpose for constructing the TAL-Corpus is to collect and organize well-established and long traditions of traditional Arabic lexicons which can also be used to create new corpus-based Arabic dictionaries. The compilation of the TAL-Corpus followed standard design and development criteria that informed major decisions for corpus creation. The corpus building process involved extracting information from disparate formats and merging traditional Arabic lexicons. As a result, the TAL-Corpus contains more than 14 million words and over 2 million word types (different words). The TAL-Copus was applied to create useful morphological database. This database was automatically constructed using a new algorithm which is informed by Arabic linguistics theory. The newly developed algorithm processed the text of the TAL-Corpus and as result it extracted 2 781 796 entries. These entries were stored in the morphological database where each represents a word-root pair (i.e. an Arabic word and its root). A comparative evaluation of the TAL-Corpus and other three Arabic corpora showed that the lexical diversity of its vocabulary scored higher. Moreover, its coverage was computed by comparing words and lemmas against their equivalents of other corpora where it scored about 67% when comparing words and 82% when comparing lemmas.

In addition, many Arabic lexical databases were constructed. The morphological Analyzer for Arabic (BAMA) (Buckwalter 2002; contains Arabic-English lexicon files. One of them is contains 82 185 stems which was reused in many Arabic NLP tools such as morphological analyzers and spell checkers. Similarly, AyaSpell a spell checker for Arabic depends on a lexicon which was built by analyzing 5 traditional Arabic lexicons. It contains more than 50 000 entries distributed on more than 10 000 verbs and more than 40 000 nouns, particles and residuals (Zarrouki & Kebdani 2009;Zarrouki & Balla 2009). A third example is the Arabic WordNet (AWN). It is a lexical resource for MSA which is based on the design and the contents of the Princeton WordNet (PWN) for English. The semantic background for the AWN were encoded in a large ontology that contains around 1 000 terms and 4 000 definition statements (Elkateb & Black 2001;Black & El-Kateb 2004;Elkateb, Black et al. 2004;Rodríguez, Farwell et al. 2008). Likewise, Arabic Verbnet is a large lexicon for Arabic verbs. It contains verb entries where each entry is a third person masculine singular perfect verb. It has 173 classes which contain 4 392 verbs and 498 frames (Mousser 2010). Aralex is a lexical database which was developed to study the cognitive processing of Arabic on relation of precise frequency counts. Aralex was built depending on a 40-million word MSA corpus which was collected from online newspapers. It provides information about orthographic forms, stems, roots and patterns and their frequencies (Boudelaa & Wilson 2010). Quranic Arabic WordNet (QAWN) is a word net for the Qur'an and consists of 6 918 synsets that were constructed from about 8 400 unique word senses, on average of 5 senses for each word (Al-maayah et al 2015). These lexical databases are designed and built for a specific purpose and for specific Arabic NLP application. They are small in size and they are designed for MSA only (Sawalha 2011).
This paper describes an important lexical resource that is constructed to improve Arabic lexicography and Arabic NLP tools. The Corpus of Traditional Arabic Lexicons (the TAL-Corpus) is constructed from the text of 23 traditional Arabic lexicons spanning the period of over 1 200 years. The TAL-Corpus will be used as part of a large lexicographic corpus of Arabic to build new modern Arabic dictionaries. The TAL-Corpus can also be used to study of the evolution of Arabic vocabulary system. The TAL-Corpus is accessible via an online interface which allows users to search for lexical entries.

Traditional Arabic Lexicons and Lexicography
Arabic lexicography is a well-established and deep-rooted art of Arabic literature. Arabic lexicography was founded by al-farāhῑdῑ (died in 791) who constructed the first Arabic lexicon kitāb al-'ayn ‫َالعين‬ ‫كتاب‬ 'al-'ayn lexicon'. Over the past 1 400 years, many Arabic lexicons were constructed. The lexical entries (i.e. roots) appear in Arabic dictionaries and followed by a definition part which may span several pages. The definition part is written as a unit or an encyclopaedic article which defines all the derived words from a certain root. These lexical entries are not arranged or distinguished with special formatting. Figure    where the derived words of the root (k-t-b) are underlined and highlighted in blue Four main classes of ordering lexical entries in lexicons were developed and followed by authors of Arabic lexicons. Three arrangement methodologies depend on the roots of the words as lexical entries for Arabic lexicons. The fourth one groups lexical entries according to their conceptual themes or topical frames. These arrangement methodologies are different than those used in modern English dictionaries. Lexical entries of common English dictionaries, which are words (i.e. lexical entries in form of lemmas), are arranged alphabetically followed by the  [al-kitāb] the book is something which has been written on. And in Hadith: who looks at his brother's book without permission is as looking to hell. Ibn Al-Atheer said: it is a similarity; which means as he avoids hell, he should avoid doing this. He said: the meaning (of the Hadith) is the punishment by hell will be applied if someone looks at a book without permission. He said: it might be the punishment of visual explorers as the crime is done by sight. Hearing explorer is punished if someone intentionally listened to other people who do not like anyone to listen to them. He said: this Hadith is specific for books of secrets and secure books, whose owners hate anybody to look at these books. It is also said: the Hadith is general; applied to any type of books [kitāb].

The al-ẖalῑl Ordering Methodology
The first traditional Arabic lexicon is called ‫كتابَالعين‬ kitāb al-'ayn "al-'ayn lexicon". It was developed by ‫الخليلَ‬ ‫َالفراهيدي‬ ‫َأحمد‬ ‫بن‬ al-ẖalῑl bin aḥmad al-farāhῑdῑ (died in 791). The al-ẖalῑl ordering methodology, which was followed in constructing 'The al-'ayn' lexicon, arranges the lexical entries phonologically according to places of articulation of phonemes from the mouth and throat, working forwards from glottal through to labial regions. The al-'ayn lexicon was divided into books, where one book was dedicated for each letter. Each book was then divided into 4 sections according to their internal structure: (i) doubled biliteral roots; (ii) intact triliteral roots; (iii) doublydefective roots; and (iv) quadriliteral and quinquitiliteral roots. Many lexicons followed al-ẖalῑl's methodology with slight modifications.  'The Correct Language'. Roots are the lexical entries of this lexicon. They were alphabetically ordered according to their last letter, then the first letter. This methodology is called the al-ğawharῑ methodology. The lexicon was organized into chapters where each chapter corresponds to the last letter of the root. Each chapter includes sections corresponding to the first letter of the root, then the second letter of triliteral roots, then the third letter of quadriliteral roots, then the fourth letter in quinquitiliteral roots. For example, the word َ َ ‫ب‬ َ َ ‫َط‬ ‫س‬ baṣaṭ "spread" which is derived from the root (b-s-ṭ) is found in chapter ‫ط‬ ṭ representing the last letter of the root, and in section ‫ب‬ b representing the first letter of the root. Table [1] lists some of traditional Arabic Lexicons that followed this ordering methodology.

The al-barmakῑ Methodology
The third lexicon ordering methodology is "The al-barmakῑ methodology". It was developed by abū al-ma'ālῑ muḥammad bin tamῑm al-barmakῑ ‫َتميمَالبرمكي‬ ‫َبن‬ ‫َمحمد‬ ‫َالمعالي‬ ‫أبو‬ (died in 1006). In this methodology, lexical entries (i.e. roots) are alphabetically arranged according to the first letter of the root. al-barmakῑ lived in the same period as al-ğawharῑ. al-barmakῑ did not construct a new Arabic lexicon. Instead, he re-arranged the lexical entries of ‫الصحاحَفيَاللغة‬ aṣ-ṣiḥāḥ fῑ al-luḡa h , which was developed by al-ğawharῑ. The al-barmakῑ methodology was followed by ‫الزمخشري‬ az-zamaẖšarῑ (died in 1143) in constructing his lexicon ‫َالبالغة‬ ‫أساس‬ asās al-balāḡa h "Fundamentals of Rhetoric". Table [1] lists Arabic lexicon which followed the al-barmakῑ methodology for ordering lexical entries. The al-barmakῑ methodology for ordering lexical entries becomes the most widely used ordering methodology for Arabic lexicons.

The abū 'ubayd Methodology
abū 'ubayd al-qāsim bin sallām ‫َّم‬ ‫ُبيدَالقاسمَبنَسال‬ ‫أبوَع‬ (died in 838) developed the fourth ordering methodology for Arabic lexicons which is called "The abū 'ubayd methodology". This methodology arranges and groups together lexical entries according to their semantic fields. This arrangement methodology is similar to arranging lexical entries in modern thesauri. Many lexicons followed this ordering methodology. ‫فَفيَاللغة‬ ّ ‫صن‬ ُ ‫الغريبَالـم‬ alḡarῑb al-muṣannaf fῑ al-luḡa h "The Irregular Classified Language" by abū 'ubayd al-qāsim bin sallām was the first lexicon that followed this methodology. This lexicon includes many small books that describe similar topics (i.e. group words of similar meanings) such as books describing horses, milk, honey, flies, insects, palms, and human creation. Then, more than thirty small books were collated into one large lexicon. Figure [3] shows a sample from Colours' Book taken from al-ḡarῑb al-muṣannaf fῑ al-luḡa h lexicon.

The Design of the TAL-Corpus
The motivation behind constructing the TAL-Corpus is to collect and organize well-established and long traditions of traditional Arabic lexicon in one freely available resource. The TAL-Corpus will help Arabic lexicographers to design and construct new modern Arabic dictionaries. These dictionaries can have new ordering methodology where derived words can be easily linked with their lexical entries whether they are roots or lemmas. The TAL-Corpus can be used to determine the origin of Arabic vocabulary and can track the development and changes of their meanings. The TAL-Corpus can also be used to extract useful information that supports Arabic NLP applications such as root extraction applications, morphological analyzers, semantic networks of Arabic vocabulary, WordNets, ontologies … etc.
The following sections show the design criteria followed in constructing the TAL-Corpus. Atkins et al., (1992) proposed general criteria of corpus design. These principal aspects and standards were recommended to be followed to inform major decisions for corpus creation. These criteria were designed to support high-quality and compatible corpora regardless of the corpus language, purpose, and location. Sections 3.1 to 3.5 discuss the design criteria followed to construct the TAL-Corpus.

Text
The text of the TAL-Corpus was collected from 23 freely available traditional Arabic lexicons. These lexicons are listed in Table 1. Al-Meshkat Islamic Network 1 ‫شبكةَمشكاةَاالسالمية‬ šabaka t miškā t al-'islāmiyya h provides most of these lexicons freely. These lexicons have been key-boarded (i.e. typed) and put online in machine readable formats as MS-Word (.doc) or HTML text files.
The texts of the collected Arabic dictionaries were organized using different ordering methodologies as discussed in Section 2. However, most of these lexicons use roots as their main lexical entries. The definition of a root in each lexicon is written as an encyclopaedic article that contains the derived words from that root, their meanings, and examples of usages. These definitions vary in size from half a page to span several pages. Figure [4] shows a sample of text of a lexical entry taken from a traditional Arabic lexicon; the derived words are underlined and highlighted in blue. The text of the collected lexicons is fully vowelized, partially vowelized or non-vowelized. Texts (i.e. definitions) of similar roots from the different traditional Arabic dictionaries were grouped together in the TAL-Corpus. Then, several automatic processing steps and algorithms were applied to extract relevant linguistic information such as derived words and lemmas. Sections 3.4 and 3.5 discusses in detail these processing steps and algorithms.
( For all collected lexicons, common processing steps were applied. These steps include; (i) converting the file   After collecting the text of 23 traditional Arabic dictionaries, common pre-processing steps were applied. First, all dictionaries' files were converted into standard text files using Unicode 'utf-8' encoding. Then, the SALMA-Tokenizer and the SALMA-root extractor and Lemmatizer (Sawalha, 2011) were used to tokenize and process Arabic words by striping diacritics, and extracting the root and the lemma for each word in the TAL-Corpus. Third, frequency lists of both vowelized and non-vowelized word were generated (see Table [1] and Figure [5]).
Special algorithm was developed to extract the derived words of the lexical entries for the dictionaries included in the TAL-Corpus. The purpose of this algorithm is to group together roots and their definition parts and then to extract derived words of roots from their related definition articles. To achieve this goal, a specific treatment were applied to each dictionary text. The 23 collected dictionaries were originally constructed following an ordering methodology of their lexical entries as discussed in Section 2. Most of them use roots as their main head words of lexical entries. These dictionaries were typed into machine-readable files in different formats without using any lexicographic representations that can be recognized by Computers. Therefore, specialized programs were developed for each dictionary to reformat and extract useful information such as roots, definitions and derived words.
The root-definition structure is the common basic structure for most traditional Arabic dictionaries. Each lexical entry consists of the root as a head word and the definition part. The definition part is written as an encyclopaedic article featuring free writing style. These encyclopaedic articles defines the root and its derived words and their linguistic attributes are specified. However, the derived words of a root within the definition part are neither structured nor ordered. This free writing style requires the authors of dictionaries to add affixes and clitics to the derived words within the definition parts. Clitics, such as conjunctions, prepositions and connected pronouns, are used to connect sentences and paragraphs of these definition articles.
For the above mentioned-reasons, the free writing style of the definition part adds extra challenges to extract the derived words and their definitions. Therefore, a dedicated algorithm was developed to extract the roots and their derived words from the dictionaries' texts. The tokenizing module in the program specifies the boundaries of a lexical entry which is normally starts with a root followed by an article that defines that root. For each lexical entry, the algorithm extracts and pairs words from the definition part with the root and stores them in vectors (i.e. bag of words). Many of these word-root pairs are not correct matches (i.e. the word is not derived from the associated root). A normalization analysis verified these word-root pairs by throwing out pairs where the word is not derived from its associated root. The normalization procedure applies linguistic knowledge that governs the derivation process of words from their roots. These linguistic rules were used to match the consonant letters of words and roots and their order for each word-root pair. The first linguistic rule checks if all consonant letters forming the root appear in the paired word. The second rule examines if all root letters orderly appear in the derived word. Both rules must be applied to every word-root pair for verification. This process is applied to extract the derived words of a root and later to build a morphological lexicon (See Section 3.3.1). Figure [7] shows the process of selecting word-root pairs. Table [ 5] shows the number of words and the percentage of words extracted from the original text of the dictionaries.

Advanced Text Handling
The TAL-Corpus implements advanced text handling tools which can automatically process linguistic information in a corpus and allow more sophisticated statistical analyses. Lexical database (i.e. the SALMA-ABCLexicon) was created using the extracted information from the TAL-Corpus text.

Link to Lexical Database
The TAL-Corpus was used to construct the SALMA-ABCLexicon. The SALMA-ABCLexicon is a lexical database that contains around three million word-root pairs. This lexical database was extracted from the text of the TAL-Corpus following the analyses steps as described in Section 3.2. These steps include (i) manually converting the traditional Arabic dictionaries' text into a unified format; (ii) a specialized algorithm extracts a bag of words from the definition part text of Arabic dictionaries where word-root pairs are stored; (iii) two linguistic rules were applied to the word-root pairs to verify that words are derived from the associated roots.
Later, a specialized program combines the disparate lexicon information into one large broad-coverage lexical resource the SALMA-ABCLexicon. A lexical information of a large dictionary called ‫لسانَالعرب‬ lisān al-'rab 'Arab tongue' was feed to the program as a seed for the SALMA-ABCLexicon. All word-root pairs of the first dictionary were included in the SALMA-ABCLexion which represent around 48% of the total records. Around 82% of the words and roots of ‫المحيطَفيَاللغة‬ mu'ğam al-muḥῑṭ fῑ al-luḡa h dictionary were added which represents around 14% of total records. َ ‫َالقاموس‬ ‫َجواهر‬ ‫َمن‬ ‫َالعروس‬ ‫تاج‬ tağ al-'arūs min ğawāhir al-qāmūs dicitionary contributes 74% of its records which represents around 22% of the total records. The percentage of added records decreases during the combination process. This decrement indicates the termination of the combination process and which traditonal Arabic dictionaries are better to construct the morphological lexicon.  Figure [8] shows the first 60 derived words of the root ‫كتب‬ k-t-b 'wrote'.  Figure 8. The first 60 lexical entries of the root ‫كتب‬ k-t-b 'wrote' stored in the SALMA-ABCLexicon

The TAL-Corpus Markup
Markups are introduced to the TAL-Corpus to indicate its features such as lexicon name, lexical entry, and definitions of lexical entries. The TAL-Corpus is formatted using XML technology where lexicons are reformatted and their lexical entries are alphabetically arranged. All traditional Arabic lexicons that form the TAL-Corpus are stored using XML files. XML is a markup language that facilitates the labelling or tagging of corpus features. The use of XML allows formatting and labelling the features of the TAL-Corpus. Figure [9] shows the XML structure and the labels used to format the corpus files. <Lexicon id = "1" ar_name = ‫تاج"‬ ‫العروس‬ ‫من‬ ‫جواهر‬ ‫"القاموس‬ eng_name = "tağ al-'arūs min ğawāhir al-qāmūs" author_ar = ‫"الزبيدي"‬ author_eng = "az-zubaydῑ"> … <lexicon_entry id = "8391"> <root>‫/<كتب‬root> <text>‫كتب‬ : (    These corpus markups were effectively used when a web interface 2 for searching the contents of the corpus was developed. The web interface allows users to access the contents of the corpus, to search for a root and to retrieve the definition parts from the traditional Arabic lexicons included in the TAL-Corpus. Figure [10] shows part of the web interface for part of the results after searching for the root ‫"كتب"‬ k-t-b.

Evaluation
The purpose of constructing the TAL-Corpus is to introduce a new lexicographic corpus that contains the majority of standard Arabic vocabulary. This kind of corpus will not only help in the design and development of Arabic monolingual dictionaries but also it can support constructing Computational Linguistics resources such as; morphological dictionaries, frequency lists, lexical and morphological databases, etc. The SALMA-ABCLexicon is a lexical and morphological dictionary that was constructed using the TAL-Corpus text (see Section 3.3.1). It contains slightly under three million word-root pairs.
There are no mature standard criteria for evaluating newly constructed text corpora (Atkins et al, 1992). Therefore, our criteria for evaluating the TAL-Corpus should meet the goal for construction. We need our corpus to include the majority of standard Arabic vocabulary. Moreover, these vocabularies should be diverse and cover contemporary as well as classical ones. Lexical diversity is defined by McArthy and Jarvis (2010) as "the range of different words used in a text, with a greater range indicating a higher diversity". Lexical diversity (LD) is computed as the token-type ratio. The lexical diversity of the TAL-Corpus scored 0.152. It was evaluated by comparing it against the LD of rival Arabic corpora. The Arabic Web 2012 (arTenTen) corpus belongs to the TenTen corpora family which was created by harvesting web pages using SpiderLing. It contains around 7.5 billion tokens which represents around 2 million word types . Its LD scored about 0.000263. Similarly, the Arabic Internet Corpus was developed by harvesting articles from webpages published in Arabic. It contains around 165 million tokens and more than 4 million different tokens. Its LD is computed and scored 0.025965. The third corpus used in this comparative evaluation is the Arabic Wikipedia corpus (wiki-ar) 3 . It contains around 16 million tokens and slightly less than 1 million types. The LD for this corpus scored 0.057.    Figure 11. The coverage percentage of the TAL-Corpus using exact match method Arabic is a morphologically rich language. Therefore, most Arabic words in context are complex words. Clitics and affixes are attached to the words in context which remarkably increase the various forms of words. Clitics make the matching process with lexical entries of the SALMA-ABCLexicon not an easy task. Hence, the coverage percentage would decrease. As an alternative, the coverage of the TAL-Corpus was computed by matching the lemmas of the SALMA-ABCLexicon with the lemmas of the three corpora. The SALMA-Lemmatizer (Sawalha, 2011) was used to lemmatize the three corpora and the lexical entries of the SALMA-ABCLexicon. The SALMA-Lemmatizer also includes a list of function words. The other part of this experiment excludes function words from the coverage calculations. Tables [9] and [10] show the coverage percentage of the TAL-Corpus computed by matching lemmas including and excluding the function words respectively. Figure 12 shows a summary of the coverage of the TAL-Corpus based on matching lemmas. The evaluation experiments of the TAL-Corpus by computing its coverage against three Arabic corpora showed that it does not fully cover words that belong to the categories; (i) function words; (ii) new Arabic terms; (iii) relative nouns; and (iv) borrowed words. Function words such as ‫ِكََ‬ ‫ل‬ َ ‫ذ‬ ḏālika "that"; َ ِ ‫إ‬ َ ‫و‬ ‫ى‬ َ ‫ل‬ wa-'ilā "and to"; َ ‫ُم‬ ‫ه‬ َّ ‫ن‬ ِ ‫إ‬ 'innahum "they are"; and ‫التي‬ allatī "which" were not covered in the TAL-Corpus. These words can be easily added by including traditional Arabic grammar books in the corpus (Diwan 2004). Second, new Arabic terms such as ‫دردشة‬ dardaša t "chat"; ‫انقر‬ 'unqur "click" and ‫االنتخابات‬ al-'intiẖābāt "elections" are not covered because these words have appeared recently due to recent technical and social developments. Unfortunately, modern Arabic dictionaries are not available in machine readable format. Therefore, including these dictionaries in the TAL-Corpus requires retyping these dictionaries and reformating them in a machine readable format. Third, relative nouns ‫َالمنسوبة‬ ‫األسماء‬ al-'asmā' al-mansūba h are nouns that indicate affiliation of something to these nouns. Relative nouns such as ‫َة‬ ّ ‫السياحي‬ as-siyāḥyya t "tourism"; ‫َة‬ ّ ‫االجتماعي‬ al-iǧtimāʿiyya t "social"; and ‫َة‬ ّ ‫الثقافي‬ aṯṯaqāfiyya t "cultural" have become widely used in the media and modern standard Arabic. Annexing this group of words to the TAL-Corpus can be achieved by including modern Arabic dictionaries. Fourth, borrowed words such as ‫الدكتور‬ ad-duktūr "doctor"; ‫اإليميل‬ al-'imayl "e-mail"; ‫التليفون‬ at-tilifūn "telephone"; and ‫اإلنترنت‬ al-'intarnit "Internet" are foreign words transliterated into Arabic by using Arabic letters. Borrowed words are frequently found in newspaper and web pages text because of the lack of standard translations of them. However, Arabic Language Academies (i.e. organizations which are responsible for standardizing Arabic) are producing specialized dictionaries and word lists that translate these technical terms 6 into Arabic. These specialized dictionaries can be included in the TAL-Corpus to increase its coverage. Figure [13 Figure 13. A sample of common words which are not covered by the TAL-Corpus

Potential Users and Uses
The purpose for constructing the TAL-Corpus was to provide a collection of traditional Arabic dictionaries that can be analysed, studied and used to create comprehensive language resources such as; new Arabic dictionaries; frequency lists; collocates; morphological dictionaries, etc. Obviously, the potential users for the TAL-Corpus are lexicographers, Arabic linguists, language learners and computational linguists. The following is a discussion of potential uses of each expected user of this corpus.  Computational linguists: Corpora are essentially used by computational linguists to build language models for machine learning algorithms. The TAL-Corpus could be used to build language models for Arabic morphological analysers, stemmers and lemmatizers. As well as, language models for sematic analysis can be built for Arabic using the TAL-Corpus. Computational linguists can build tracking programs that investigate the development of Arabic vocabulary and the changes of their meanings. The TAL-Corpus includes traditional Arabic dictionaries of a period that span more than 1 200 years which enables tracking the development and changes of meaning for Arabic vocabulary. In conclusion, the TAL-Corpus is an essential resource for extracting useful information that supports a wide verity of Arabic NLP applications such as; root extraction applications, morphological analysers, semantic networks of Arabic vocabulary, WordNets, ontologies … etc.

Discussion of the Results, Limitations and Improvement
The TAL-Corpus is constructed using text from traditional Arabic dictionaries. It is characterized by a wide coverage of Arabic words, word types and roots. The evaluation proved that the TAL-Corpus has a wide coverage of about 85% of the test corpora words. Despite the time span of 13 centuries of the traditional Arabic lexicons from which the TAL-Corpus has been derived, only 15% of the test corpora words were not captured. The latest Arabic dictionary included in the TAL-Corpus is ‫َالوسيط‬ ‫المعجم‬ al-mu'ğam al-wasῑṭ which appeared in 1960s.
Hence, new vocabulary items added to Arabic in the past 50 years are not covered in the TAL-Corpus. Moreover, due to the advances in telecommunication and information technology; globalization; and the wide and intensive use of social networks, words of foreign languages have been increasingly used in both spoken and written Arabic. These foreign words do not have a proper translation into Arabic, but are written using Arabic letters (i.e. transliterated). Advances in telecommunication and information technology imply new products with their original names have entered Arab countries. These products keep their original names which have been widely used and become part of the contemporary Arabic vocabulary. Moreover, the use of dialectical Arabic has increased in the written and spoken forms due to open systems such as chat rooms, blogs and forums, and social networks which allow people to write text without restrictions.
The TAL-Corpus was used to construct a broad-coverage morphological database the SALMA-ABCLexicon. This database did not involve any manual correction due to the limitations in funding. However, an automatic correction and verification procedure was applied to part of the database. The verification procedure was performed by counting how many times the word-root pairs appear in the analyzed traditional Arabic dictionaries. 976 427 wordroot pairs representing 35.19% of the lexicon's word-root pairs scored a count of 2 or more. This means that these word-root pairs appeared in different dictionaries. Therefore, these word-root pairs have a high potential to be valid and correct.
This is the first version of the SALMA-ABCLexicon. It can be extended to include the full morphological analyses of the lexical entries and other useful information that will enhance the performance of NLP applications. Special linguistic lists such as compounds, collocations, idiomatic phrases, phrasal verbs and named entities can be added to extend the lexicon. Moreover, morphological lists such as broken plurals, intransitive and transitive verbs, rational and irrational words and primitive nouns can be another extension to the lexicon. The SALMA-ABCLexicon can also be extended by adding modern and dialect vocabularies from newly constructed Arabic corpora and the web.

Conclusions
The Corpus of Traditional Arabic Lexicons (the TAL-Corpus) is a special corpus which is constructed from the text of 23 traditional Arabic dictionaries. These dictionaries are spanning over a period of 1 200 years. The corpus contains 14 369 570 words and 2 184 315 word types. The motivation for building the TAL-Corpus is to collect and organize well-established and long traditions of traditional Arabic lexicons. The TAL-Corpus can also be used to construct new corpus-based Arabic dictionaries. Corpora were not used to construct Arabic dictionaries and lexical databases yet. Therefore, building corpora for the purpose of building new Arabic dictionaries is needed.
Thousands of traditional Arabic dictionaries were constructed in the past 1 200 years. These dictionaries are different size, type and ordering of their lexical entries. The wide variety of traditional Arabic dictionaries represent rich base for building a corpus that can be further used and exploit to construct new corpus-based Arabic dictionary.
The TAL-Corpus followed standard design and development criteria that informed major decisions in corpus creation. The text of the TAL-Corpus is composed from the text of 23 freely available and machine readable traditional Arabic dictionaries. These dictionaries were processed to have a unified format. The unified format is based on arranging the contents of the corpus by roots (i.e. the head words for the majority of traditional Arabic dictionaries) and their definitions. Then, the SALMA-root extractor and lemmatizer were used to tokenize, strip diacritics, and extract roots and lemmas for each word in the corpus. Frequency lists of both vowelized and nonvowelized word were also generated.
The SALMA-ABCLexicon is constructed by analysing the TAL-Corpus text. The processing steps in constructing the SALMA-ABCLexicon involve; applying linguistic rules that were encoded in a specialized program to extract the root and the words derived from that root. Second, a combination algorithm merges the information extracted from the previous step into one large broad-coverage lexical database. The SALMA-ABCLexicon contains 2 781 796 vowelized word-root pairs which represent 509 506 different non-vowelized words.
The TAL-Corpus is stored and distributed using XML technology. The corpus XML files contain all markups which indicate the corpus features. The choice of using XML technology is to facilitate the distribution and the use of the corpus. The TAL-Corpus is an open-source resource which is licenced under a Creative Commons Attribution-NonCommercial 4.0 International Licence.
The evaluation of the TAL-Corpus was done by computing its coverage over three Arabic corpora; the Corpus of the Contemporary Arabic; the Qur'an text; and the Arabic Internet Corpus. The coverage was computed by matching the words of the test corpora to the words in the SALMA-ABCLexicon, which scored about 67%. A lemmatizer program was used to compute the coverage by matching the lemmas of the test corpora and the lemmas of the SALMA-ABCLexicon. This method scored a coverage of about 82%.
The potential users for the TAL-Corpus are lexicographers, Arabic linguists, language learners and computational linguists. The potential practices for TAL-Corpus are to provide a collection of traditional Arabic dictionaries that can be analysed, studied and used to create comprehensive language resources such as; new Arabic dictionaries; frequency lists; collocates; morphological dictionaries, etc.