The Power-Law-Tail in the Distribution of the Nucleotides of Genomes Was Related to the Complexity of Organism : New Classification of Organisms

We proposed a new index of the classification of organisms (cells) based on the appearance frequency of four nucleotides (bases) of various genomes. In double logarithmic plot of L (distance of a base to the next base, x-axis) vs F (frequencies of a base at L, y-axis), each value of four bases was expressed in y = ae at L = 1 ~ 15, and y = Ux + W (power-law-tail) at L = more than 16 bases, respectively, in a single-strand of DNA. The a-, band U-values (slope) of four bases were resulted from the GC-content (%) and the size (nt) of the genome. Moreover, each value was identical as A to T, and as G to C, respectively, in one organism. The power-law-tail should be unique to the genomes of the same species, the eukaryotes, the prokaryotes. The eukaryotic genomes were essentially composed of great number of bases with plural long power-law-tail regions when compared with those of the prokaryotes. In the prokaryotes, the base-distribution was partitioned at L = 20, and the U-values (base-distribution in power-law-tail region) of the archaea were similar to the eukaryotes compared with those of the eubacteria. Thus, the power-law-tail of the genomic DNA should be come from the structural features of the cells, i.e., the size, the GC-content and other characteristics of the genomic DNA. These results indicated that the power-law-tail would be specific for the complexity of organisms in individual genome, and might be a new index for cells.


Introduction
Recently, with the rapid progress of the genome-projects for many prokaryotic-and eukaryotic-cells (NCBI genome database, 2013;The Sanger Institue, 2013;KEGG Organisms, 2013), the whole-genome analyses of the biological phenomena for the individual cells were carried out to understand the organisms (Coleman et al., 2006;Eisen et al., 2006;Perfrey et al., 2006;Zmasek & Godzik, 2011).
The four bases in the genomic DNA were arranged sophisticatedly in all cells and distinguished the coding-and the non-coding region clearly on the genomic DNA, and analyzed the entire genome containing both the codingand the non-coding sequences because the key sequences were distributed throughout the genome in cells (Takeda & Nakahara, 2009, 2013;Takeda, 2009Takeda, , 2011Takeda, , 2012;;Nakahara & Takeda, 2010a, 2010b).In other words, all genes containing the non-coding base sequences of the genomic DNA could be maintained in the cells, and each gene in the genome could be expressed to be subjected to the living cells.
As described above, studies on the appearance frequency of the base sequence were very useful in clarifying the structural features of the genomic DNA, (1) the reverse-complement symmetry, (2) the bias of the distribution of four bases, (3) the multiple fractality of the distribution of the respective four bases co-exiting even in a single-strand DNA of the genome (Takeda & Nakahara, 2009;Takeda, 2009Takeda, , 2011)), and were resulted from the above structural features of the genomic DNA to characterize or to identify the interactive regions between chromosome-gene, protein-protein, protein-DNA and protein-tRNA from the structure (the base sequence) of the entire genome or chromosome (Nakahara & Takeda, 2010a, 2010b;Takeda, 2012;Takeda & Nakahara, 2013).
In early 1990s, many physicists interested in biology were reported to place four bases, A, T, G, C, with the self-similarity (fractality) in the genome base sequences (Peng et al., 1992;Voss, 1992;Bains, 1993;Weinberger & Stadler, 1993).But in those periods the data were limited for the complete genome and chromosome, therefore, these reports were based on the virus genomes, or a part of the prokaryotic-or a part of the small eukaryotic-chromosomes, the model genomes (chromosomes) based on the former speculation or the preliminary data for the genomes.At present, the base sequences of the genomic DNAs in many cells, including the eukaryotic-and the prokaryotic-cells, can be clearly observed and the base sequence of the genomic DNAs can be viewed the cells.
In 1992, the base sequence of chromosome III of S. cerevisiae were firstly determined in eukaryotic genome (Oliver et al., 1992), and since then the Genome Projects of S. cerevisiae (Mews et al., 1997), H. sapiens (Dunham et al., 1999) and many Genome Projects of the eukaryotic-and the prokaryotic-cells were accelerated and the base sequences were reported (NCBI genome database, 2013;The Sanger Institute, 2013;KEGG Organisms, 2013;Saccharomyces Genome Database, 2013).Based on these results, it was possible to carry out the detailed analyses of the base sequences of the genome, and reported.The genome of eukaryotic cells were possessed the large non-coding sequences in the base sequences of genome.The non-coding sequences were composed of many regulatory elements which might be necessary to express the genetic information precisely, rapidly and steady (Takeda & Nakahara, 2009;Takeda, 2009;Takeda, 2012;Takeda & Nakahara, 2013).
The distribution of four bases of the genomic DNA was generally arranged sophisticatedly with the symmetry, the bias and the multiple fractality with different fractal-dimension, i.e., an exponentially decreased-phase at short distances and a linearly decreased-phase (a power-law-tail, the fractality with the multi-fractal dimension expressed in y = Ux + W) at long distances in double logarithmic plot of base-distribution, respectively (Takeda & Nakahara, 2009;Takeda, 2009;Takeda, 2011).The exponentially decreased-phase with the uni-fractal dimensions (Barthelemy et al., 2000;Ordemann et al., 2000;Yu et al., 2004) of base(s) at short distances, expressed in y = ae -bx , was observed in not only any genomes or chromosomes, but also even in their artificial genomes or chromosomes with the same genome-size (nt) and molar-contents of four bases as mentioned above.Thus, the uni-fractality expressed as y = ae -bx was the general phenomena in the arrangement of four bases not only in the genomic-DNA, but also in the artificial-DNA generated with the same molar-contents of the four bases and the base numbers as each single-strand of the genomic DNA.Furthermore, the artificial chromosomal sequences had only the reverse-complement symmetry, but not the base bias, and we could not find any Open Reading Frames (ORFs) in the artificial chromosomes (Takeda & Nakahara, 2009;Takeda, 2009;Takeda, 2011;Takeda, 2012).
By contrast, the linearly decreased-phase (a power-law-tail) observed at long distances, the power-law-tail was unique and specific in individual real (active) genome for the behavior of the base-distribution for each four bases.The power-law-tail could be observed only in the active genomes or chromosomes even in the virus and the organelle, but never in the distribution of base(s) of the artificial-genomes or -chromosomes, i.e., each gene in the living cells could be expressed when all the structural features of a single-strand of the genomic DNA were present (Takeda & Nakahara, 2009, 2013;Takeda, 2009Takeda, , 2011Takeda, , 2012)).
The effect of the adjacent base sequences of the gene(s) was observed in living cells.In eukaryotic cells, most of them had the non-coding sequences (the regulatory elements) which were occupied 2 ~ 98% of the entire genomic DNA (Mattick, 2004;Taft et al., 2007), and they were deeply affected on the adjacent base sequences to express the gene(s) by the appropriate stage and organ according to the information of the generation-rule for each genome, the reverse-complement symmetry, the bias and the multiple fractality of the base(s)-distribution, i.e., most of the regulatory elements (base-sequences) for the gene-expression, the cellular complexity and the inheritance of living cells were placed sophisticatedly in the non-coding regions of genome.As mentioned above, to generate an active genome, the genomic base sequences should be arranged to be present in these three characters in the genome structure together at least.Recently, there were many non-coding small RNAs on genome and participated in the regulation of the gene-expression (Lai et al., 2005;Martens et al., 2005;Taft et al., 2007;Molnar et al., 2010).In addition, the ratio of these non-coding RNAs in genome were regulated to the individual gene expression, and showed to be affected on the complexity of the living cells (Mattick, 2004;Lynch, 2007).Many other regulatory sequences, enhancers, promoters, poly (A)-binding signal, MAR (SAR), insulators, introns etc of individual genes were present on genome (Webb et al., 1992(Webb et al., , 2002;;Levine & Tjian, 2003).These regulatory elements (base sequences) should be compact or adaptable.Therefore, the regulatory elements were either not evident or were little observed in prokaryotic genomes, and were apparently different regulation, i.e., the transcription, the other genomic-and the molecular-events in cells, the gene-organization in genome, and each gene-expression from eukaryotic cells (Kozak, 1983;Niehrs & Pollet, 1999).
On considering the various regulatory sequences of individual genes on chromosome, it was very suggestive that the genome-scale analysis of human Down Syndrome Critical Region (DSCR) consisted of approximately 1 Mb on chromosome 21 could not be caused sometimes the syndrome in mouse (Olson et al., 2004(Olson et al., , 2007)).In other words, the observation based on the gene-scale was not always in accord with that based on the genome-scale.Such researches suggested that the chromosomal (genomic) DNA might be also a molecule with a huge molecular weight and a higher-order structure of chromosome including both the coding-and the non-coding regions, and it would be necessary to analyze the entire base sequences of genomes to understand the gene expression in the living cells (Takeda & Nakahara, 2009, 2013;Takeda, 2009Takeda, , 3011, 2012;;Nakahara & Takeda, 2010a, 2010b).
In this report we focused on whether the base sequences of the whole-genome with the multiple fractality, especially the power-law-tail in more than 16 bases of L-value (Takeda & Nakahara, 2009;Takeda, 2009Takeda, , 2011) ) at the tail-end region of the distribution-curve of base(s) in genomic DNA could be related to respective species, and to be prompted the progress of the chromosomes and to create the diversity, or the complexity of the living cells.

Distribution Curve of Base A, T, C, or G in Genome or Chromosome
In this paper, we had used the base-distribution as the frequency F (L) instead of the probability P (L).In the A-base, the x-axis was L and the y-axis was the base frequencies F (L) in double logarithmic plot of the equation 1, y = ae -bx .The same calculations for the other bases were performed as previously reported (Takeda & Nakahara, 2009;Takeda, 2009Takeda, , 2011Takeda, , 2012;;Nakahara & Takeda, 2010a, 2010b).Each base-distribution curve F (L) was expressed the distribution of the distance L between a base and the next base, for the base "A", the L-value was corresponded the base numbers from "A" to the next "A" in the genomic DNA, and F (L) was the sum of the frequency with the same base-distance in the genomic DNA as follows;

Adenine Base (A)
S. cerevisiae was maintained in three structural features in a single-strand of the genomic DNA of 16 chromosomes in nuclei as mentioned above, i.e., the GC-contents (%) for all of them were AT-rich, almost identical, approximately 38.0% irrespective their different chromosomal sizes.S. cerevisiae was basically an eukaryote consisting of a single-cell, and the 16 chromosomes of the cells were not large, a kind of small size of the genomic DNA (total base numbers were ca.12,155,038 nt).In double logarithmic plot of L vs F, in the case of the L value was 1 through 15, the frequency F (L) of the base-distribution of the adenine base (A-base) was proportional to an exponential equation, y = ae -bx (eq.1, x = log L, y = log F (L); a and b were constant).In the case of A-base in the S. cerevisiae genome, the a and b values were calculated from the equation 1 (eq.1) as 1.0 (E + 06) and 0.3446, respectively (Figure 1).
(1) S. cerevisiae, genome (12,155,038 nt) In contrast, when the L value was more than 16, the distribution of base(s), F (L) was different, i.e., gave a straight line, y = Ux + W (Equation 2 = eq.2; U was the slope and W was the intercept).Figure 1 showed the distribution of A-base in several eukaryotic chromosomes.In all chromosomes, when the L-value was used at maximum to be appeared the base-frequency in the genome, the distribution curve of the A-base was not proportional to the single-curved line expressed the uni-fractality (see the curved lines in (a)-panels of Figure 1-(1) ~ ( 6)).Therefore the L-value was partitioned (the detailed description about the L-partition as below).(Tables 1 and 2).The presence of the plural power-law-tail regions was shown the frequencies of each four bases of the larger eukaryotic genomes and chromosomes (Figure 1-( 5), 1-( 6), Tables 1 and 2).They had the long power-law-tails, and partitioned at L = more than 16 bases.These long power-law-tail regions were observed in other eukaryotes composed of larger size genome than S. cerevisiae (Tables 1 and 2, Figure 1).In these genomes, the long power-law-tail regions were partitioned at L = 16 ~ 35 (designated as slope 1), 36 ~ 75 (slope 2), 76 ~ 120 (slope 3) and 121 ~ more than 121, usually max.L-value was used (slope 4).The L-value became larger, the slope-value became smaller, for instance, slope 1 was usually larger than slope 2. In all genomes, the a-.b-and U-values were shown identically as A to T, and as G to C in a single-strand of DNA, respectively, in one chromosome or genome (Tables 1 and 2, Figure 1).
(1) Slope 1 vs Genome size (2) Slope 2 vs Genome size U-values (slope) were used the slope 1, slope 2, and slope 3.  4)).The results might indicate the difference of the base-frequencies in the non-coding, the power-law-tail regions of genome.These values might be variable on the account of the length and the base-sequences of the A-, T-, C-and G-bases in the regions, i.e., slope 2 might be GC-poor and AT-rich a non-coding region.The non-coding region contained the regulatory elements consisting of the specific base sequences and the repetitious base sequences.
Table 1 showed the a-, b-and the U-values in the large chromosome or genomes.
These large eukaryotic chromosome had the long power-law-tail region, and could be observed not less than two at the linearly-decreased region, although the frequency F (L) of each base-distribution of A, T, G or C was proportional to an exponential equation, y = ae -bx (eq.1, x = log L, y = log F (L); a and b were constant) at 1 ~ 15 of the L-values.In other words, in the large chromosomes with the power-law-tail region, the region should be partitioned two or more to be appeared as above described (Tables 1 and 2, Figure 1).
Of the larger eukaryotic chromosomes, the power-law-tail region in plant-genome such as A. thaliana and O. sativa, might be two or three phases, although the plant-genomes were different from the genome-size (nt) of the animals (Tables 1 and 2, Figure 1).
Most eukaryotic cells had larger chromosomes or genomes than S. cerevisie, and the each four bases distribution was shown in Tables 1 and 2. As shown in Table 2, the a-, b-and U-values were varied in both the eukaryoticand the prokaryotic-cells.Interestingly, these values were almost identical with those of the complement bases, i.e., A-to the T-, the G-to the C, even in a single-strand of the genomic DNA of the individual cells.In addition, the b-value was proportionally related to the GC-content (%) of the genome or the chromosome, but the a-value was related to the chromosome-size (nt) of the individual cells (Figure 2).These features of the a-and b-values were observed in both the eukaryotic-and prokaryotic-cells (Table 2).In all cases, the a, b and U-values of A and T, G and C were almost identical in the genomic DNA (Table 2).
Generally, the genomes or the chromosomes of the eukaryotic cells were of two types.One possessed the long linearly-decreased regions (the power-law-tail), and the others possessed the longer linearly-decreased region (the large power-law-tail).The former eukaryotic cells were of plants, and the later eukaryotic cells were of primates, mammalians and vertebrates.The power-law-tail region could be partitioned two or more according to the base-distance, "L" (Tables 1 and 2).The longer the chromosome's L-values was, the fractality exhibited more phases.
Table 3 showed the a-, b-and the U-values of the archaea and the eubacteria genomes.These values were varied and diverse in the archaea in comparison with the eubacteria because most of archaea could be come to live under the severe conditions.In addition, the difference between the slope 1 and 2 might be derived from the dispersion of four bases in the non-coding regions containing the elements to be necessary to execute various extreme regulations in the cells.Thus, the archaea might be more similar to the eukaryotes in comparison with the eubacteria in the genome-level (Tables 1 -3).*: Maximum L-value estimated A-base-frequency.

Thymine (T), Guanine (G) and Cytosine (C) Bases
The other three bases, the thymine base (T), the guanine base (G), and the cytosine base (C) of the eukaryotes, the eubacteria and the archaea were of a similar patterns as the A-base.In the T-base, the distribution-curve, a-, b-and U-values were respectively identical with the A-base, and also, the distribution-curve, a-, b-and U-values of the C-base were respectively identical with the G-base as previously mentioned (Tables 1 and 2).Table 2 showed the distribution of four bases in both the prokaryotic-and the eukaryotic cells reported (NCBI genome database, 2013; The Sanger Institue, 2013; KEGG Organisms, 2013; Saccharomyces Genome Database, 2013).Thus, the structural features (generation-rule) of the base contents of genomic DNA should have the same features in any organisms corresponding to those organisms, as shown in the previous paper (Takeda & Nakahara, 2009;Takeda, 2009Takeda, , 2011)).

Biological Meaning of a-, b-, and U-Values
Both a-, and b-values of each chromosomes of the cells were almost the same in an organism although in the case of H. sapiens chromosomes, the a-and b-values were slightly different from 24 chromosomes because these values might be related to the GC-content (%) of each chromosome (Table 1).
The U-value was the slope at the long distance of the base distribution in chromosome, and also related to the base-contents of each chromosome.But this power-law-tail region of the base(s) was unique and has not been found in any chromosome yet.In other words, the region was the power-law-tail distributed four bases in genome, and essential in individual chromosomes of the cells (Tables 1 and 2, Figure 1).
In the case of the distribution of the base(s) in genome, the linearly-decreased region of "L" (the power-law-tail) was present to the long distances in double logarithmic plot of L (long-foot region of the distance of a base to the next base) vs F (L) in all organisms.The power-law-tail was more readily observed in the genomes from the eukaryotic cells.The other three bases, the T, the G and the C were distributed with the similar multiple fractality (equation 1, equation 2 = power-law-tail) in each chromosome and mtDNA of S. cerevisiae (data not shown).
In every genome, the a-, b-and U-values of the A-base to those of the T-base were equal, and those of the G-base and the C-base were equal in the Equation 1 and the Equation 2(Tables 1 and 2).The b-value was also correlated with the GC-content of each chromosome (Figure 2).In the AT-rich genomes such as fungi and large genomes such as the plants, the slope 3 was observed slightly in the A-and T-bases, but observed clearly in the C-and the G-bases although it was due to escape out of the slope 3 to consist of low base numbers.Thus, even in the smaller genomes and the plant genomes had two or three linearly-decreased regions (power-law-tails) (Tables 1  and 2, Figure 1).
In addition, the a-, b-and U-value might be maintained, parallel to the genome-size (nt) and the GC-content (%) in one chromosome.H. sapiens chromosome 1 composed of 247,249,719 nt were registered in NCBI as 39-divisions because of its largeness at July 30, 2009.Out of 39 divisions, we selected three contigs, NT_004610.18(12,702,424 nt), NT_032977.8(73,835,825 nt) and NT_004487.18(56,413,061 nt) of H. sapiens chromosome 1 in this analysis.
The same facts that the a-, b-and U-values were equal in each contig were observed in other chromosomes of H. sapiens.Not only the large eukaryotic genomes such as mouse, rat, dog, chicken, cow, fish, insects, plants and so on, but also the small genomes such as fungi genomes were observed generally the slope 2 although they were not easy to distinguish the slope 2 because of the small base numbers (Table 1).Furthermore, the two slopes in the power-law-tails were specific to the individual eukaryotic genome, therefore, the power-law-tails were useful to classify cells based on the distribution of the four bases in genomes.

Power-Law-Tail in the Large Genome
In large genomes such as mouse and human, the power-law-tail could be observed different in the distribution of four bases in the genome base sequences.Thus, the large linearly-decreased region of "L" (power-law-tail) could be composed of two or more different slopes (Tables 1 and 2, Figure 2).These two regions were overlapped around at L = 35, and the slope 1 was usually larger than the slope 2. Slopes 1, 2 and 3 seemed to be distributed variable base-sequences.Therefore slopes 1, 2, 3 and 4 were reflected on the different base-distribution, respectively.Each slope-region might be performed respective different regulatory phenomena in the cells.The slope-values, especially the slope 1, might be reflected on the genome-size (nt), or the synergism of the genome-size (nt) and the GC-content (%) (Table 1, Figure 2).In AT-rich genomic DNA as A. thaliana chromosomes, the slopes 3 and 4 were not observed from the frequencies of the A-and T-base in the power-law-tail regardless the genome-size (nt) (Table 1).
In every case, the slopes 1 ~ 4 existed in the non-coding region, and the four bases in the regions were placed to form the power-law-tails, and the base-complement even in a single-strand of DNA (Tables 1 and 2, Figures 1  and 2).

Discussion
In large genomes, the power-law-tail could be partitioned into two or more with the different slopes around 35, 75, 120 bases-distances in double logarithmic plot of L (the distance of a base to the next base) vs F (L), and these might be related to the evolution of genomes, or the complexity of the cells.The larger genomes such as H. sapiens were big non-coding regions in the genome over 97% (Raphael et al., 2008;Loots & Ovcharenko, 2010), and the non-coding regions should be essential to express the gene(s) in precisely, rapidly and steady.
The chromosomes of fungi were displayed the power-law-tail clearly, but most of eubacterial genomes were apparently obscure the existence of the power-law-tail region.The archaea genomes possessed longer power-law-tail regions than those of eubacteria (prokaryotes) and close to the eukaryotic genomes although they were prokaryotic cells.The eukaryotic genomes were essentially composed of a great number of bases, and revealed a long power-law-tail region.Of eukaryotic cells, in A. thaliana and O. sativa genomes as higher plants, the power-law-tails were observed not so large within 80 bases, and the genomes of the higher plants have shown two different slopes (ex.slopes 1 and 2) the linearity of the boundary around 35 bases (Tables 1 and 2).By contrast, not only in H. sapiens and M. musculus such as the mammalians, but also in other eukaryotic cells, animals, fishes, insects, warms or protozoa genomes, the region with long power-law-tail was more than 100 bases, and it might be observed two or more phases around 75, and 120 bases with the linearity even in each chromosome itself (Tables 1 and 2).
As shown in Table 2, the a-value of P. aerophlum (archaea) was contrary in comparison with the genome-size (nt) because the a-, b-or U-values were variable on the account of the base-location in this region irrespective of the smaller genome-size.These results suggested that the archaea were also similar to the eukaryotes beside the eubacteria at genome-level.
Figure 1-(1) showed the A-base in S. cerevisiae genome (12,155,038 nt, Figure 1-(1)) as described in Materials and Methods.Figure 1-(2) ~ (4) showed the A-base-distribution in several chromosomes of S.cerevisiae.Figure 1-(2) showed the A-base in chromosome I of S. cerevisiae (230,203 nt).In the case of the short-distance (Figure 1-(2)-(b), L = 1 ~ 15 bases) the frequencies of the A-base in chromosome I decreased exponentially, whereas in the case of the long-distance (Figure 1-(2)-(c), L = more than 16 bases), the frequencies of the A-base in chromosome I decreased linearly (power-law-tail).In addition, both regions were overlapped around L = 11 ~ 15(Takeda & Nakahara 2009;Takeda 2009).The small chromosomes III (315,350 nt, Figure1-(3)) of S. cerevisiae were also observed in both the exponentially-and the linearly-decreased boundary regions of the A-base such as chromosome I (Figure1-(2)) in a single-strand of the DNA.The frequencies of the other three bases, T, G or C were basically similar to the A-base as A to T, and as G to C, respectively, in one organism.

Figure 2 .
Figure 2. Relation between the a-, b-and the U-values of various species based on Table 2

Figure 2
Figure2showed the relationship of each value in various genomic DNA based on the data in Table2.The slope 1 (U-value) was proportional to the genome-size (nt) (Figure2-(1)), and the b-value was proportional to the GC-content (%) of the genomic DNA (Figure2-(5)).But the slope 2 and slope 3 were not proportional to the genome-size (nt) as the slope 1 (Figure2-(2) and Figure2-(3)).The a-value was also proportional to the genome-size (nt) as the slope 1 did (Figure2-(4)).The results might indicate the difference of the base-frequencies in the non-coding, the power-law-tail regions of genome.These values might be variable on the account of the length and the base-sequences of the A-, T-, C-and G-bases in the regions, i.e., slope 2 might be GC-poor and AT-rich a non-coding region.The non-coding region contained the regulatory elements consisting of the specific base sequences and the repetitious base sequences.

Table 1 .
The a, b and slope values of 4 nucleotides in eukaryotic genomes *: Base number;

Table 3 .
The a-, b-and U (slope)-values of A-base in archaea and eubacteria genomes