A Model to Approximate the Distribution of Rank Order Associations

The relationship between two set of ranks can be evaluated by several coefficient of rank-order association. To judge the significance of an observed value of one of these statistics we need a reliable procedure for determining the p-value of the test. In several works the t-Student has been suggested as being relevant for the description of the null distribution of many coefficients. In this article, we propose a new model of density function, the generalized Gaussian on a finite range, which can be used to model data exhibiting a symmetrical unimodal density with a bounded domain. Several simulations illustrate the advantages of this technique over conventional methods. This is particularly useful in the case the number of ranks is larger than the threshold for which the exact null distribution is known, but lower than the threshold for which the asymptotic Gaussian approximation becomes valid.


Introduction
The extent of agreement between two rankings of n items, numbered from 1 to n, can be tested by using a non-parametric statistic of rank correlation in place of the Pearson product-moment correlation.The most well known statistics of this type are the Spearman, Kendall and Gini coefficients, which will be denoted, respectively, as r 1 , r 2 and r 3 .The former is the most often used measure in research in which the dependence is assumed monotonic but otherwise arbitrary.In comparison with r 1 and r 2 , Gini's r 3 seems to be applied rather rarely at present, although its characteristics are similar, and sometimes better, to those of the other rank correlations.
To judge the significance of an observed value of one of these statistics, say r h , we need the exact distribution of r h under the hypothesis of independence or, at least, a reliable procedure for determining the p-value of the test.Significance levels could, for example, be calculated using asymptotic methods.In this regard, the convergence to the Gaussian distribution renders its use legitimate in interpolating the p-values of r h , h = 1, 2, 3. Nonetheless, already Old (Olds, E. G., 1938) states that a distribution with a finite range causes trouble at the tails when a Gaussian fit is attempted, and, this is particularly relevant to studies where we are particularly interested in the tails.KIendall et al. (KIendall, et al., 1939) add that Gaussian approximation is satisfactory for moderately large values, but for small values it is subject to the disadvantage inherent in any attempt to represent a distribution of finite range by one of infinite range, that is, the fit near the tails it is not likely to be very good.On the other hand, rank correlation statistics lies in the interval from −1 to 1 and we think it is better for clarity to test them by using a theoretical curve with a bounded, rather than infinite, domain.
In order to circumvent these difficulties, many researchers have looked for probability densities which are capable of fitting the distribution of rank correlations appropriately, including: Johnson S B (Johnson, N-L., 1949), Tadikamalla-Johnson L B (Tadikamalla & Johnson, N-L., 1994).In particular, Pitman (Pitman, E. J. G., 1937) noted that the first four moments of the r 1 -distribution were similar to the first four moments of the symmetrical beta or Pearson type II distribution.Continuing this idea, Landenna et al. (Landenna, G., 1989) proposed the symmetrical beta for the Gini coefficient r 3 and Vittadini (Vittadini, G., 1996) suggested it for the Kendall coefficient r 2 .One key factor behind the wide diffusion of this model is the strict relationship between the symmetrical beta curve and the Student's t density function (Willink, 2009).This allows for the use of easy tables and hence ensures computational convenience and simple checking of results.
Our objective in this paper is to devise a new model for estimating the p-values of some rank association indices in the case n is larger than the threshold for which the exact null distribution is known but lower than the value of n for which the Gaussian approximation becomes valid.The structure of the paper is as follows.In Section 2, we succinctly discuss the characteristics of r 1 , r 2 and r 3 .A new density function, the generalized Gaussian on a finite range (GGFR), is introduced in Section 3 and the prediction of p values is presented in Section 4. We conclude in Section 5.

Indices of Rank Order Association
The degree of monotone association between rankings can be measured by rank index of association.The coefficients reported in Table 1 are in general use.
] where π is a permutation of order n. S where sign(x) takes the values −1, 0, +1 according to whether x is negative, zero or positive, We remark that the expressions presented in Table 1 are the reverse permutations of π and η.The larger r h is, ignoring the sign, the stronger the association between rankings is.All the three indices can be interpreted as differences between the distance from perfect direct association (1, 2, & Di Bucchianico, A., 2001), compute the null distribution of r 1 for n = 19, . . ., 22 using the representation of its probability generating function as a permanent (a signless determinant) with monomial entries.See also Maciak (Maciak, W., 2009).It is interesting to note that he quantities S 1 and S 3 appearing in r 1 and r 3 can be expressed as a sum of parts which allows the use of combinations of sub permutations that significantly reduce the amount of computation required to build the exact distribution.See Otten (Otten, A., 1973) for a division of the permutations in two groups.Girone et al. (Girone, G., et al., 2010) went further by breaking up the permutations into four groups and executing a parallel processing scheme that, by the way, is naturally fit to Otten's proposal.Research to date has obtained the null distribution of the Spearman coefficient up to n = 26 (Gustavson, 2009) and that of the Gini coefficient for up to n = 24.The same procedures cannot be applied for Kendall's S 2 , which, however, benefits from a recurrence relationship.See Panneton & Robillard (Panneton & Robillard, 1972).
Under the null hypothesis of independent rankings, the distributions of r 1 , r 2 and r 3 are symmetrical and have support in [−1, 1].All the odd moments are zero because of the symmetry.Furthermore, and this is essential in our paper, their variance and kurtosis are known as polynomials in n, as it is shown in Table 2.
Table 2. Second and fourth moments of where k n = nmod 2. Both µ 2 (n) and µ 4 (n) are decreasing function of n with the values relative to r 3 always intermediate between those of r 1 and r 2 (the former systematically greater than the latter).It can also be observed that, because of the presence of k n , the moments of r 3 have an oscillating character due to the odd-even parity of n, that is the number of items.

A New Model of Density Function
A good model should reproduce the characteristics of r 1 , r 2 and r 3 generally observed over the whole population of permutations.Specifically, curves must be unimodal, symmetrical around zero, bounded in the interval [−1, 1]; moreover, as the range widens, they must tend towards the Gaussian probability distribution.The usual procedure to determine theoretical probabilities and expected frequencies is to find a curve capable of providing all the required peculiarities cited above, and then to integrate for the probabilities over the given intervals.
The main contribution of the present paper is to provide a theoretical explanation for the behavior of several coefficients of rank order association.Suppose that the relative variation of the probability density of the absolute value of the random variables f (|r|) representing the rank order association is inversely proportional to 1 − f (|r|).
To estimate the parameters of the GGFR we will follow the moment-matching method as.In this regard, the second and fourth centered moments of the GGFR density are ) . (3) The variance µ 2 (λ) increases for a higher λ 1 or for a lower λ 2 , whereas the excess kurtosis γ 2 (λ) = µ 4 (λ) /σ 4 (λ) − 3 rises to zero for a decreasing λ 2 or diminishes to zero for an increasing λ 1 .
Let us consider a loss function in which the lowest two even moments of r h,n (for n ranks) are matched to those of a GGFR density.
In addition, the presence of a beta function depending on the unknown parameters can create difficulties in numerical stability.To increase the chances of getting a global solution in reasonable computational times, we executed a controlled random search (CRS) algorithm discussed, for example, in Conlon (Conlon, 1992) and Brachetti et al. (Brachetti, et al., 1997).See Amerise et al. (Amerise, 2015) for more details on the procedure used.
In Table 3 we show the estimates of λ for a few values of n with G (λ) < 0.1 × 10 −16 in each experiment.The rows σ 2 and γ 2 indicate, respectively, the variance and the excess kurtosis of the coefficients obtained on the basis of Table 2.All the three rank correlations have a platykurtic null distribution, which is flatter than Gaussian.This characteristic is more evident for Spearman's r 1 whereas, under this point of view, Kendall's r 2 is the closest to the Gaussian distribution.
Generally, as n increases, the parameter estimates increase; moreover, the variance of the best fitting density decreases and the associated excess kurtosis remains negative but tends to zero (which could be a symptom of asymptotic Gaussianity).
The trend for Gini's cograduation has two branches, one for even and the other for odd parity of n.This alternating behavior is due to the strong effect of k n = n mod 2, which appears both in the expressions and in the moments of r 3 .

Approximations of p-values
To compare the GGFR solution with the asymptotic method (Gaussian density) and t-Student alternative for small samples, we consider both exact and fitted significance levels α of the test H 0 : rankings are independent against H 1 : rankings are dependent, by using ρ 1 (Spearman), ρ 2 (Kendall), ρ 3 (Gini).Let r 1 , r 2 and r 3 indicate, respectively, the empirical values of Spearman, Kendall and Gini rank order associations.The statistics involved in the t-Student approximation are with The statistics involved in the standard Gaussian approximation are

Accuracy of Approximations
Iman & Conover (Iman & Conover, 1978) correctly observe that the discreteness of rank correlations often leads into situations where no critical region has exactly the size α.Rather there will be a choice of using the next smaller exact size called conservative p-value (denoted by C α,h ) or the next larger exact size called liberal p-value (denoted by L α,h ).Let ρ α,h,C and ρ α,h,L be the quantiles of ρ h , h = 1, 2, 3 corresponding to the probability levels C α,h and L h,α , respectively.The test of H 0 and H 1 above is conclusive if both the conservative and the liberal p-values lie on the same side with respect of the prefixed nominal level α.If C α,h < α < L α,h , then the test is unreliable.
To investigate the accuracy of the proposed approximations, we examine a set of 500 nominal levels α = 0.0001, 0.0002, • • • , 0.0500.For each α we compute both the actual C α,h and the fitted C α,h,k conservative p-values and repeat the same calculation for the actual L α,h and the fitted L α,h,k liberal p-values.The fitted p-values are based on GGFR (k = 1), Gaussian (k = 2) and t-Student (k = 3) probability densities.A summary of the results is presented in Table 4.The most notable figures are emphasized in bold font.For reason of space, attention is focused on n = 19, • • • , 24 which are the largest values of n for which the exact null distribution is known for all the three rank correlations.The quantity δ α,h,k = 0.5

Discussion and Conclusion
Over recent years, the number of ranks for which the exact null distribution is fully available has increased for many measures of monotone association.For problems involving a number of ranks, which is not included in the existing software though, it is necessary to resort to the omnipresent Gaussian approximations while awaiting faster and more economical computers.However, the Gaussian density can be misleading, particularly in the tails, which often are the most important part.In this paper, we have demonstrated the usefulness of the generalized Gaussian density with finite range (GGFR) for fitting the exact null distribution of three statistics which are routinely used for measuring the correlation between two rankings: Spearman, Kendall and Gini coefficients.All that is required is that variance and kurtosis be known functions of the number of ranks.
The performance of the GGFR is decidedly superior to that of t-Student and Gaussian distributions, which are traditionally employed to estimate tail probabilities for the Spearman and the Gini coefficients.The situation regarding the Kendall coefficient is rather different.In this case, the Gaussian model achieves the best results.Improvement over conventional procedures (Gaussian and t-Student densities) does not appear impressive, but touches on the distribution tails, which are the most interesting from a practical point of view.It must be added that the GGFR achieves the largest improvement in fitting the null distribution of Spearman's ρ, which is the most known and probably most used rank correlation coefficient.

Table 1 .
Rank association statistics assume absence of ties.Coefficients in Table 1 vary within the range:[−1, 1].The extremes are achieved if and only if there is perfect association, negative or positive, for all pairs:r h

Table 5 .
Observed and predicted p-values.thecombined evaluation of more than one approximation may throw light on the correct significance of a test.With regard of r 3 , the p-values proposed by GGFR are more conservative than Gaussian and t-Student distributions.
0001, • • • , 0.05} gives the average distance between lower and upper fitted and actual significance levels and it is used to assess the quality of approximation.High values of δ α,h,k indicate that approximations to the null distribution of the rank correlation h, based on model k, far exceed or under-run at least one exact threshold at