Exploratory Analysis of Biometric Data Concerning Characteristics of Urucum ( Bixa orellana L . ) in the Northeast of Brazil

Urucum is a plant adapted to the soil and climate conditions of the semi-arid region. This study evaluates the biometry of urucum seeds. Twenty seeds of annatto were collected in an area of native vegetation with presence of the species, located in the Mossoró Mountains in the municipality of Mossoró, State of Rio Grande do Norte, Northeast Brazil in July 2017 and taken to the plant breeding laboratory, where the following characteristics were evaluated: (a) morphological characterization of the seed being determined the length and width in millimeters, of 200 seeds well developed, with the aid of a pachymeter with precision of 0.1mm and (b) weight of the seed expressed in grams. Descriptive and graphical analyzes were carried out using the statistical software R. The length and width showed a small range of variation, resulting in excellent and reasonable values of coefficients of variation, respectively. We found a regular degree of symmetry and a mesokurtic distribution for length and width of seeds. There was no significant linear correlation between length and width. The features of urucum seeds did not fit to the normal distribution of probability.


Introduction
The urucum or urucu (Bixa orellana L.), a shrub species native from tropical America, has gained importance in the world agricultural market every year and is cultivated in the tropics all over the world (Mercadante & Pfander, 1998).The fruits are ovoid capsules containing 30 to 40 seeds (Lorenzi, 1998).The urucum produces particular atoxic coloring substances with the indication as a hypolipidemic agent, which makes it one of the few pigments allowed by the World Health Organization (Silva et al., 2006).
The species has multiple uses, such as ornamental and medicinal plant and soil restorer.Urucum is a pioneer species native to the Amazon Rainforest.Its seeds are used to produce condiments and tinctures widely used in industry.The fast growth of urucum allows the cultivation together with other species in degraded areas destined to reforestation (Lorenzi, 1998).
Morphological analyzes of fruits and seeds help to understand the germination process and the characterization of vigor and viability (Mathuem & Lopes, 2007).Biometric analyzes comprise an essential tool to detect genetic variability within and between populations and to define the relationships between the variability and environmental factors, thus contributing to genetic improvement programs (Gusmão et al., 2006).
weight almost twice the seeds of Heritiera parvifolia.Also, the seeds characteristics influence its patterns of dispersion and establishment of seedlings (Fenner, 1993), which are used to differentiate pioneer and non-pioneer species in tropical forests (C.C. Baskin & J. M. Baskin, 1998).In most shrubs and trees, there is an inverse relation between seed size and the number of seeds per fruit (Carvalho et al., 1998).
Urucum stands out for its therapeutic potential and as food.However, there is no record of its exploration in the region of Mossoró, RN, as well as, studies of urucum biometry, showing the need for studies concerning this species.
The objective of this study was to evaluate the use of statistical techniques of exploratory data analysis, the biometric variables related to the seeds of annatto, as support for studies comparing their characteristics when submitted to different environments and aiming at genetic studies and plant breeding.

Materials and Methods
The study was carried out in Mossoró, RN, geographic coordinates: 5º11′ S and 37º20′ W at 18 m of altitude, with an annual mean temperature of 27.5 °C and relative humidity of 68.9% (Carmo Filho et al., 1991).According to Köppen's classification, the climate in Mossoró is BSwh', hot and dry.Twenty seeds of annatto were collected in an area of native vegetation with presence of the species, located in the Mossoró Mountains in the municipality of Mossoró, State of Rio Grande do Norte, Northeast Brazil in July 2017 and taken to the plant breeding laboratory, where the following characteristics were evaluated: (a) morphological characterization of the seed being determined the length and width in millimeters, of 200 seeds well developed, with the aid of a pachymeter with precision of 0.1 mm and (b) weight of the seed expressed in grams.The descriptive analysis and graphs were done with the software R version 3.1.1(2018).

Results and Discussion
To determine and compare quantitative aspects of the distributions of values of the variables seeds length and width we based the analysis on specialized biometric literature (Ferreira, 2005;Spiegel & Stephens, 2009;Oliveira et al., 2009;Bussab & Morettin, 2010;Zar, 2010;Claudio & Stein, 2011;Cecon et al., 2012).Thus, we adopted the exploratory data analysis using frequency distributions, box plots, as well as the statistical estimators of the variables under study, which are the main descriptive and inferential statistical measures, such as arithmetic mean, median, total range, variance, standard deviation, standard error of mean, coefficient of variation, asymmetry coefficient, kurtosis coefficient, quartiles and interquartile deviation.We used the Pearson correlation coefficient and statistical inference such as the hypothesis tests, T-test or Z-test, at a significance level of 5% probability, based on Student's t-distribution and Normal distribution, respectively, for the construction of confidence intervals with 95% and 99% probability (Tables 1 to 5 and Figures 1 to 9       Regarding quantiles, the first quartile shows that 25% of the lowest values for the length, width and weight of the seeds reach a maximum of 4.15 mm, 2.51 mm and 20.25 mg, respectively.On the other hand.Showed that the 25% of seed lengths, widths and weights are represented by at least 4.60 mm, 3.37 mm and 29.15 mg, respectively (Figueiredo et al., 2007).In this case the interquartile range obtained, which serves to verify the dispersion of the data in relation to the median and thus to identify the presence of outlier's data, was 0.45 mm for the length of 0.86 mm for the width and 8.90 milligrams for seed weight (Table 4).
Rom a graphic, visual or geometric point of view, the dispersion diagram (Figure 7), showed high dispersion which confirms the lack of correlation between length and width found in the low value of correlation coefficient (0.14).The length and width variables of urucum seed did not fit to theoretical or special density distribution of normal or bivariate Gaussian probability so that all the conclusions obtained through statistical inference are not guaranteed or assured by this assumption (Table 4).
Reporting the quantiles comprises an excellent way to illustrate the dispersion of a distribution.Researchers are more familiar with the type of quantile called percentile because of its use in standardized tests.When a test score is reported within the 90th percentile, 90% of the scores are smaller than it, and 10% are higher.Contrary to variance and standard deviation, quantile values do not depend on arithmetic mean or median values.When distributions are asymmetric or have outliers, which are extreme values that are not characteristic of the distribution quantile box plots can portray the distribution of data more accurately than mean and standard deviation.
We obtained high values at the Z test, both for seed length and seed width, concluding that the mean values of these characteristics were highly significant (Table 4).
According to the results shown in Table 5 by means of descriptive and inferential statistical analyzes, it was verified that there is no simple linear correlation of Pearson r statistically significant as a function of the values obtained for or are low myths, between variables length in millimeters with width also in millimeters, length in millimeters with weight in grams and width in millimeters and weight in grams.In addition, inferential statistical analyzes by constructing confidence intervals with 95% and 99% probability, showed that in repeated sampling there is a high reliability that these results occur or are confirmed in at least ninety nine times the on the other hand also in function of the application of the Student's t-parametric test to or, it was verified that this produced levels described or p-value for said test of results with quite high probability values not allowing the researchers reject the hypothesis of nullity that the true population coefficient of simple linear correlation of Pearson ρ between these evaluated variables is null, that is, the hypothesis of nullity H0: ρ = 0 is rejected.It is worth noting that the null hypothesis used in this work will always be equal to zero in order to guarantee the symmetry of the sample distribution of the estimate or r so that it can be modeled through the curve of a theoretical distribution of Student's t-probability, since if one assumes a hypothesis that this theoretical population coefficient is different from zero, that is, H0: ρ ≠ 0 would have to be applied a Fisher zeta transform to be able to guarantee such symmetry of the sample distribution of the simple linear correlation coefficient of Pearson r, and thus allow the construction of the confidence intervals and the application of the parametric tests of Student's hypothesis t (Fonseca & Martins, 2012).These results reinforce the need to conduct other works repeated in time and space to verify convergent or divergent results guiding researchers in the evaluation of genetics and plant breeding as well as yield performance of this plant species to make it viable as a crop commercial.
Boxplots, using the median and the quartiles are a widely used chart in the biological and medical sciences, showing the median, first and third quartiles.It also shows the lowest and highest scores through the lower and upper boundaries of vertical straight lines, which originate from the first and third quartiles, respectively.According to the results obtained in Figure 2, using the median and quartiles, a strong concentration of the seed length data in millimeters around the central value in the case the median was observed, which evidenced a high homogeneity of these observations, which makes its analysis and current and subsequent interpretation, including the making of statistical inferences such as the construction of confidence intervals and the application of hypothesis test tests, as well as the adjustment of regression models for estimation and forecasting purposes.
Using the mean and standard deviation, the box plot graph, similar to the previous graph, shows the mean and standard deviation in the Box.It also shows the lowest and highest scores through the upper and lower boundary of vertical straight lines where the presence of outliers can also be verified (Figure 3).In the case of seed length, only an atypical value occurred, showing a high similarity among the grouped observations, except for the presence of this unusual observation.Also according to the results obtained in Figure 3, most seed length data clumped around the central value, in the case the mean, which suggests a high homogeneity of these observations.
According to the results obtained in Figure 5, a strong concentration of Urucum seed width data in millimeters around the central (median) value was observed, showing a high homogeneity (Figure 5).Using the mean and standard deviation (Figure 6), the presence of atypical data was not observed, showing a significant similarity in the pooled observations and a strong concentration of the seed width data around the central (mean) value, which also evidenced the high homogeneity.
Regarding seed weight using the median and quartiles (Figure 8) and mean and standard deviation (Figure 9), a large concentration of data was found around the central value, which evidenced a strong homogeneity of these observations.We also did not verify the presence of atypical data, showing a relevant similarity between the pooled observations.In general, the results of the descriptive measures of location, variability, asymmetry and kurtosis can serve as a basis for future studies of descriptive analysis and statistical inference, for the comparison of different environments, genetic improvement studies, grouping of experiments in joint analysis, in stability analysis of cultivars, as well as in the construction of so-called components of variance.(Ferreira, 2005;Figueiredo et al., 2007;Oliveira et al., 2009;Spiegel & Stephens, 2009;Bussab & Morettin, 2010;Casella & Berger, 2010;Zar, 2010;Claudio & Stein, 2011;Cecon et al., 2012;Costa, 2012).
Standard deviation and variance are special cases of what statisticians and physicists call the central momentum (CM).Central moment comprises the mean deviation of all observations in a data set from the mean of the observations, raised to the power r.The first central moment (r = 1) measures the sum of the differences between each observation minus the sample mean (arithmetic), which is always equal to zero.The second central moment (r = 2) is the variance.The third central moment (r = 3), divided by the standard deviation to the cube (s³), is the asymmetry.The asymmetry describes how the sample differs from the form of a symmetric distribution.
A normal distribution has an asymmetry coefficient equal to zero.A distribution in which the value of the asymmetry coefficient is greater than zero has asymmetry to the right, that is, there is a long tail of larger observations at the right of the mean.However, if the asymmetry coefficient is less than zero it has asymmetry on the left, there is a long tail of smaller observations to the left of the mean.The kurtosis or flattening has its basis in the fourth central moment (r = 4), measuring the extent or peak at which the probability density is distributed in the tails versus the center of the distribution.
The distribution is classified as heavy tail or light tail when compared to a standard normal distribution (mesokurtic).Aggregate or platykurtic (flattened) distributions have a kurtosis coefficient less than zero, compared to the normal distribution, meaning more mass of probability in the center of the distribution and less probability at the tails.In contrast, leptokurtic (tapered) distributions have a kurtosis coefficient greater than zero.Leptokurtic distributions have less mass of probability in the center and tails of relatively heavy probabilities (Gotelli & Ellison, 2011).
According to Gotelli and Ellison (2011), the law of large numbers proves that for an infinitely large number of observations, the formula ( ∑ Y i n i=1 ) n ⁄ is an approximation of the population mean μ, where, Y n = [Y i ] is the sample of size n of a random variable Y with expected value [E(Y)].Similarly, the variance of y n = σ 2 /n.Since the standard deviation is just the square root of the variance, y n is given by √σ 2 /n = σ/√n, what is the same as the standard error of the mean.Therefore, we have an estimate of the standard deviation of the population mean, which is the standard error.
If the conclusions based on a single sample are representative of the entire population, it is recommended using the standard error of the mean.However, if the samples limit the conclusions, it is better to use the sample standard deviation.
Large observational surveys covering large spatial scales with a substantial number of samples are likely representative of the population of interest as a whole, and the standard error of the mean should be used.Small, controlled experiments with few replicates are likely based on a single cluster, and possibly unrepresentative of individuals, as a consequence, the standard deviation should be used to characterize or measure the degree of absolute dispersion of the samples.

Conclusions
The length and width of seeds had a small amplitude of variation and a greater amplitude for the weight in miligrams, as well as an optimum, good and regular value respectively for the coefficients of variation, which shows a high, medium and regular degree, respectively, of homogeneity of these characteristics evaluated, the length being much less dispersed than the width and the weight of the seeds.
There was a regular degree of symmetry and a mesocuric distribution for the length, width and weight of the seeds.
The characteristics of seed length, width and weight of the seeds did not present significant correlation and presented mean values with highly significant differences.
It was verified that only the data regarding the width and the weight of the annatto seeds evaluated in this work were adjusted to the normal distribution of probabilities.
The results of the descriptive measures of location, variability, asymmetry, and kurtosis can aid in future studies of descriptive analysis and statistical inference, for the comparison of different environments, studies of plant genetic improvement, and support criteria used for grouping of experiments in analysis, study stability of cultivars, in multivariate analysis, as well as support the construction of so-called components of variance. Figu Figure 3 Figu Figure 6