Patterns of Codon Usage Bias in WRKY Genes of Brassica rapa and Arabidopsis thaliana

Codon usage bias (CUB) is defined as the selective and nonrandom use of synonymous codons by the organism for encoding the amino acids. One of the important plant transcription factor family is the ‘WRKY’ whose role has been investigated in the regulation of abiotic and biotic stress responses in plants. In this paper, the codon usage pattern of the WRKY transcription factor of the two important plant species Arabidopsis thaliana and Brassica rapa has been investigated. Various codon usage indices like ENc, CAI, correspondence analysis, RSCU analysis, neutrality plot and hierarchial clustering has been done. The GC codon status was high in Arabidopsis. The RSCU analysis of codons revealed that codons coding for arginine was maximum in both the plant species.Our results propose that natural selection was the main dominating factor guiding the evolution of different WRKY genes in both Arabidopsis thaliana and Brassica rapa.


Introduction
The genetic code is the set of codons used by the living cells to convert the information encoded by the DNA into the proteins.When all the codons coding for the same amino acid are used by the same frequency the term synonymous codon is used.However, an unequal usage frequency for different synonymous codons has been observed.This unequal usage frequency of synonymous codons is termed as codon usage bias (CUB).The reason behind the differences in codon usage is the variation occurring in the choice between codons ending with C or G/A or T (Campbell & Gouri, 1990).CUB throws light on the origin of genes, species and the mutational forces acting on them (Wu et al., 2017) along with it also guides in predicting the functions of related genes, structure of protein and expression of protein (Zhao et al., 2016) (Song et al., 2015).The pattern of codon usage depends on the mutation pressure, natural selection and on the sequence of amino acids (Mandlik et al., 2014).The two widely accepted theories of CUB are neutral theory and selection-mutation-drift theory (Bulmer, 1991).The neutral theory states that mutations at degenerate coding positions are neutral and the synonymous codon choice is random.The other model says that codon bias is aided by a balance between mutation pressure and genetic drift (Yang et al., 2015).There are some other possible factors affecting the CUB within the species.These include gene length (Duret & Mouchiroud, 1999), gene expression level (Hambuch & Parsch, 2005), GC content (Hu et al., 2007), environmental stress (Goodarzi et al., 2008), RNA stability (Akashi, 1997), population size, recombination rate and codon position (Behura & Severson, 2012).Over the past few years with the advent of DNA sequencing technologies, a lot of plant genes have been sequenced and deposited in the databases.This has resulted in the increased study of CUB patterns of different plants and its genes (Gustafsson et al., 2004).Variation in CUB is a unique characteristic of the genome and between specific genes of the species (Supek & Vlahovicek, 2005).Differences in codon usage have also been found in genes within the species.In studies carried out on monocot and dicot species when the third codon position is modified with G or C an increased expression of modified genes was observed.These results give an idea about the relation between the translation efficiency and codon bias in monocots and dicots (Kawabe & Miyashita, 2003).Codon usage tells about the specific pattern of gene expression and it has also been noted that the genes expressed under the same physiological state prefer the use of same codons (Chiapello et al., 1998).Plants are subjected to a wide variety of stresses simultaneously as they are sessile life forms.During the course of evolution physiological and biochemical adaptations developed in plants provides an advantage to combat a single stress but not multiple.Perception of the external stimulus is sensed by the receptors on the cell membrane that triggers the chain of molecular/biochemical reactions.WRKY are a large family of regulatory proteins which regulate diverse response against biotic and abiotic stresses through a complicated network of genes (Smith, 2000).WRKY transcription factors regulate the gene expression through activation and repression of W-box and W-box like sequences.A tight regulation is involved in the binding of WRKY proteins to the regulatory elements, as a result, they have become a target for crop improvement (Phukan et al., 2016).A possible cause for the expansion of the WRKY transcription factor family during the course of evolution is the exposure to a number of biotic and abiotic stresses (Eulgem et al., 2000).The study herein is focused to analyze the codon bias and base composition in WRKY transcription factor genes of Brassica rapa and Arabidopsis thaliana using different codon usage indices.This study will improve our understanding of WRKY gene usage pattern as well as there pattern of evolution and function in two plant species.

Gene Sequence
Complete coding DNA sequences of WRKY genes of Brassica rapa and Arabidopsis thaliana were taken from the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/).The presence of start codon and a stop codon at the end of the sequence was ensured.Sequences were also checked for the absence of stop codon in the coding frame.The length of the sequences only greater than 300 base pairs were considered in the study.

Analysis of the Base Composition
The program CAIcal (genomes.urv.es/CAIcal/) was used to calculate the GC content at the first, second and third nucleotide position (GC1, GC2 and GC3) respectively.GC12 is the average value of the GC content at the first and second nucleotide position.The values of GC1, GC2, GC3 and GC12 of Arabidopsis thaliana and Brassica rapa are mentioned in Tables 1 and 2 respectively.

RSCU Analysis
The relative synonymous codon usage (RSCU) value for a codon 'i' is defined as the value representing the ratio between the observed usage frequency of one codon in a gene and the expected usage frequency in the synonymous codon family.It is represented as, where, obs i is the observed number of occurrences of codon i and exp i is the expected number of occurrences of the same codon (based on the number of times the relevant amino acid is present in the gene and the number of synonymous alternatives to i).The pattern followed for synonymous codon usage was assessed by RSCU analysis discrediting the influence of amino acid composition.This index reflects the relative usage preference for a specific composition of codons encoded in the same amino acid (Wang et al., 2016).The RSCU of the WRKY sequences of Arabidopsis thaliana and Brassica rapa were calculated using MEGA 7 software by excluding the stop codons and the codons which code amino acid by a single code.Given that all codons for the particular amino acid are used equally.The RSCU value > 1 depicts positive CUB, value < 1 depicts negative CUB and 1 represents no CUB (Sharp & Li, 1986).The RSCU value > 1.6 shows strongly preferred codons.

ENc Analysis
The CUB present in a gene were calculated using ENc (Effective number of codon usage).The value of ENc ranges from 20 (representing extreme bias where one codon for each codon family is used) to 61 (representing random usage of all synonymous codons).The program used for the ENc calculation was genomes.urv.es/CAIcal/.The ribosomal protein gene of Arabidopsis thaliana and Brassica rapa was used as a reference set.

Codon Adaptation Index Analysis
CAI (Codon Adaptation Index) is another widely used method of CUB.It measures the deviation present in the given protein-coding gene sequence with respect to a reference set of genes.The program genomes.urv.es/CAIcal/ was used for CAI calculation (Puigbò et al., 2008).The range of CAI value is from 0 to 1.The synonymous codon usage pattern of ribosomal genes was used as a reference set.In order to ascertain the relative influence of mutation and selection, the correlation analysis between CAI and ENc values was carried out.If the selection is given more preference over mutation, the value of correlation (r) between the two quantities should be very high(r → -1).Contrastingly, if mutation force is more important, r approaches 0 (no correlation) (Chen et al., 2014).

ENc and GC3 Plot
ENc versus GC3 plot were used to analyze whether the codon usage in a set of genes is affected by mutation, selection and by other factors.When genes are distributed along the standard curve it is affected by mutation and if fall below the selection pressure operates on codon usage.The formula used in the study for calculation of expected ENc value and actual ENc value was described by Wright in 1990.The formula for actual ENc value is 2 + (9/F2) + (1/F3) + (5/F4) + (3/F6), where, F2, F3, F4, F6 is the average homozygosity of the codons with 2, 3, 4, and 6 synonymous codons respectively.Expected ENc value was calculated by the formula: 2 + GC3 + 29/(GC3) 2 + (1 -GC3) 2 .

Multivariate Statistical Analysis for RSCU Values of WRKY Genes
One of the most commonly used statistical approach to analyze synonymous CUB is correspondence analysis (Greenacre, 1984) which was implemented in the NPSS 12 software.Correspondence analysis (COA) was applied on RSCU values to investigate the trend used by WRKY genes in the two plant species, Arabidopsis thaliana and Brassica rapa.In this analysis for minimization of the effect of amino acid composition on codon usage, each WRKY gene of the two plant species was plotted as a 59-dimensional vector space.Each dimension corresponds to the RSCU value of one sense codon.The three termination codons along with codons of methionine and tryptophan were excluded.The trend of variation among the genes can be observed from the measures of relative inertia.WRKY genes were finally ordered according to the position along the axis of major inertia.

Hierarchial Clustering of WRKY Genes
We used the 59 RSCU values of each WRKY gene from the two plant species for there systematic classification.Clustering is a technique that creates clusters of data points closer to each other, and also far apart from data points in other clusters.The heat map and phylogenetic tree of different WRKY genes of the two species were constructed using NCSS 12 software clustered heat map method.The relationship of RSCU with the phylogeny was further analyzed.Each coding WRKY gene sequence was considered as a separate class, and then according to the distance between these sequences, two sequences that have the minimum distance are merged into a single class.

Neutrality Plot
The two factors affecting the CUB are mutational pressure and natural selection.The extent to which mutational pressure affects CUB as compared to selection pressure were determined by neutrality plot analysis (Sueoka, 1988).The occurrence of synonymous codon mutation is at the third position of the codon, but sometimes mutations may also occur in the first and second positions resulting into non-synonymous codons.A graph plotted using GC 3s on the X-axis and GC 12 on the Y axis and further calculation of regression line determined mutation-selection equilibrium coefficient.The regression curve effectively measures the degree of neutrality, regression line that falls near the diagonal (slope = 1) shows weak selection pressure on the CUB, whereas deviation of the regression curve from the normal indicate the large influence of natural selection on CUB (Kumar et al., 2016).

Analysis of the Base Composition
The AT and GC content of a gene have an important role to play in gene organization.The GC rich plant genes help the plant to respond to environmental stress (Tatarinova et al., 2010).Important features of the GC base pair are its higher mutability because cytosine gets frequently methylated.(Ossowski et al., 2010;Coulondre et al., 1978) and more cost of synthesis as compared to AT pair.In the present study, the AT content was higher than GC content in each of the WRKY gene CDS examined in both the plant species.Although optimal codons were mostly found to end with G/C.Song et al. (2015) analyzed the WRKY genes of Glycine max and reported higher AT content as compared to GC content.AT content in Arabidopsis thaliana ranges from 50 % to 65% and of Brassica rapa ranges from 50% to 58%.The results showed that the G+C content at the three codon positions was noticeably different GC1 is higher than GC2, and GC3 was lowest in all the three positions in both the plants WRKY genes as shown in (Tables 1 and 2).The low GC3s content in Arabidopsis genes is consistent with the previous reports (Tatarinova et al., 2010).After comparing the G and C ending status of codons of WRKY genes of Brassica rapa and Arabidopsis thaliana it was seen that the status of G and C ending codons was high in Arabidopsis thaliana genes as compared to Brassica rapa genes.A very high G and C ending status were present in some of the WRKY genes of two species.Out of the 70 WRKY genes analyzed from both the species, 15 WRKY genes of Brassica rapa and 30 WRKY genes of Arabidopsis thaliana showed codons with more frequent G and C ends.There is the importance of codons ending with G and C ends in determining protein functions like it was observed that codon usage of genes encoding for regulatory proteins naming transposases, kinases, transcription factors, and phosphatases are biased towards G and C ending codons (Fennoy & Bailey-Serres, 1993).The status of CG and TA doublet codons were higher in Brassica rapa WRKY genes namely 1,2,8,9,10,13,15,18,21,25,32,33,44,46,47,48,53,54,56,58,59,62,64,69,70.The doublet TA is a least preferred combination at most of the second and third codon position in most of the eukaryotes.In our study, it was seen that in some of the WRKY genes of Arabidopsis thaliana namely WRKY 15, 16, 19, 21, 27 the codon ending with TA is preferred.Kumar and Sharma.,1995

RSCU Analysis
The maximum RSCU value of 5 was found in Brassica rapa codon (UUU) coding for phenylalanine and the value 4.5 for codon (AGA) in Arabidopsis thaliana.Phenylalanine acts as building block of proteins and is also involved in secondary metabolite synthesis which helps in plant defense (Tzin & Galili., 2010).The codon coding for arginine was found out to be maximum in both of the species with RSCU value > 2. The codon AGA has been given higher preference as compared to AGG, CGG, CGA, CGC codons coding for arginine in most of the WRKY genes observed in Arabidopsis thaliana and Brassica rapa.Thirty Brassica rapa WRKY genes and twenty Arabidopsis thaliana WRKY genes possessed this higher RSCU values.The number of preferred codons in all the WRKY genes of both the species were different.A maximum number of preferred codons in Arabidopsis thaliana is twenty-six and the minimum is twelve.WRKY 57 has the maximum number of preferred jas.ccsenet.
codon i.e. and the mi

ENc An
The ENc v Brassica r reveals tha 33 in both rapa.Whi 55 showed

CAI An
After com out that th namely W as compar from 0.5 t genes are a including 1980; Gou

ENc an
The

Neutrality Plot
Neutrality plot was drawn with GC 12 as Y-axis and GC 3 as X-axis.The dot represents each WRKY gene of the particular species.This plot helps us to know about the key determining factors that shaped codon usage (Kumar et al., 2016).When the slope of the regression line is equal to 1, a perfect correlation is said to occur between GC 12 and GC 3 and mutation pressure is a dominant factor resulting in CUB.In (Figure 3A) the observation of neutrality plots of WRKY genes CDS in Arabidopsis thaliana shows the negative (-0.08) slope of regression line while the slope was zero in Brassica rapa (0.03) this indicates that the influence of direct mutation pressure for CUB is only 3% in Arabidopsis thaliana while the impact of natural selection was calculated out to be 97%.The genes showed low mutation bias level and it can be said that natural selection is a dominant force acting in shaping the codon usage pattern of WRKY genes of Brassica rapa and Arabidopsis thaliana.Highly expressed genes, such as translation elongation factors and ribosomal proteins have natural selection acting on them to ensure efficient translation (Hershberg & Petrov, 2008).

Effect of Gene Expression Level and Protein Length on Codon Usage Bias
In the (Figures 4A and 4B) and (Figures 5A and 5B) the significant negative correlation between CAI and GC3 (R 2 = 0.3, p < 0.05, slope = -0.4) and (R 2 = 0.19, p < 0.05, slope = -0.5)were shown in case of Arabidopsis thaliana and Brassica rapa WRKY genes respectively.In Figures 6A and 6B, positive correlation was observed between protein length and ENc (R 2 = 0.004, p > 0.05, slope = 0.003) in Arabidopsis thaliana and Brassica rapa WRKY genes respectively.We can conclude that protein length affects codon usage bias in WRKY genes of the two plant species.

Corres
For

Discussion
Our analysis of codon usage patterns among the WRKY genes of Arabidopsis thaliana and Brassica rapa reveals that most of the WRKY genes in both the species have higher AT content as compared to GC content the reason might be the small size of a protein sequence as shorter sequences have higher AT bias (Wuitschick & Karrer, 1999).The G+C content was lowest at GC 3 codon position and the status of G and C ending codons were found to be high in some of the WRKY genes in Arabidopsis thaliana and Brassica rapa.These results suggest that selection has driven codon usage of genes having an important function to have high GC content.Many factors affect the codon usage pattern among and within the species.In highly expressed genes codons that improve translational efficiency are selected while in the genes that have a low level of expression mutation-drift determines the codon usage (Bulmer, 1991).In the study, the ENC versus GC3 plot of WRKY genes of Arabidopsis thaliana and Brassica rapa showed that natural selection influences the major codon usage pattern.Similar results about WRKY genes of Medicago truncatula were obtained by Song et al. (2015).The gene expression level of different genes must be known if the analysis is done to know about the relationship between codon bias and gene expression level.In eukaryotes, it is difficult to find out the gene expression level as there is the difference in expression at different times and different tissues.In our study, we use CAI to evaluate the expression level of WRKY genes.CAI has now been considered as a well-accepted measure of gene expression (Naya et al., 2001;Gupta et al., 2004).The hierarchical clustering grouped the WRKY genes with similar RSCU values within the same cluster, the genes present in the same cluster showed similar functions.The presence of codons AGA coding for arginine with RSCU value greater than 2 was seen in all the WRKY gene sequences analyzed.The preferential use of arginine can be linked to oxidative stress response.A study performed in budding yeast revealed about the reprogramming of tRNA, which leads to the codon biased mRNA coding for arginine to be expressed under oxidative stress conditions (Gu et al., 2014).Arginine codons also plays a role in the evolution and the variability of Hepatitis A virus strains (Andrea et al., 2011).Insights into the study of synonymous codons usage pattern of WRKY genes in Arabidopsis thaliana and Brassica rapa is provided by our study.After comparison of RSCU values of different amino acids of Arabidopsis thaliana and Brassica rapa WRKY genes, it was seen that some WRKY genes have a small difference in the codon usage and hence can be used in transgenic studies.

Conclusion
In order to know regarding the forces that are responsible for CUB of WRKY transcription factor genes in two related species Arabidopsis thaliana and Brassica rapa, we examined the WRKY coding sequences of both the plant species that were present in the database.Different indices that help in the prediction of CUB have been calculated and on the basis of CAI values, it was investigated that WRKY genes are highly expressed genes that has higher AT content as compared to GC content and they show a moderate level of biases in both the plant species.Natural selection acts as main determining force shaping the codon usage pattern.WRKY genes having similar RSCU values were found to share similar functions.A positive correlation is seen between the coding sequence length of WRKY genes and effective number of codons.
Figure WRKY g

Table 1 .
observed that G ending codons for threonine, alanine, proline, serine are avoided by B.napus, B. oleracea, and B. campestris.The WRKY genes of Brassica rapa also avoided G ending codons for threonine, proline, serine, and alanine, instead preference is given to codons ending with A, U, and C. Gene length, CAI value, percent GC content at first, second, third position, ENc value of Arabidopsis thaliana WRKY genes

Table 2 .
Gene length, CAI value, percent GC content at first,second,third position, ENc value of Brassica rapa WRKY genes