A Boundary Corrected Non-Parametric Regression Estimator for Finite Population Total

This study explores the estimation of finite population total. For many years design-based approach dominated the scene in statistical inference in sample surveys. The scenario has since changed with emergence of the other approaches (Model-Based, Model-Assisted and the Randomization-Assisted), which have proved to rival the conventional approach. This paper focuses on a model based approach. Within this framework a nonparametric regression estimator for finite population total is developed. The nonparametric technique has been found from previous studies to be advantageous than its parametric counterpart in terms of robustness and flexibility. Kernel smoother has been used in construction of the estimator. The challenge of the boundary problem encountered with the Nadaraya-Watson estimator has been addressed by modifying it using reflection technique. The performance of the proposed estimator has been compared to the design-based Horvitz Thompson estimator and the model –based nonparametric regression estimator proposed by (Dorfman, 1992) and the ratio estimator using simulated data.


Background and Motivation
The goal of a researcher in survey sampling is to make estimation of the population parameters with precision and accuracy.The precision of an estimate depends on the survey strategy employed.Because of this realization, a strategy that utilizes auxiliary information has been known in literature to have an upper hand.In fact auxiliary information on finite population is often used to increase precision of estimators of parameters, (Cochran, 1977).Previous studies show that in the presence of auxiliary variable, both model based and model assisted approaches perform better than the purely design-based approach provided the assumed model that links study variable and auxiliary variables is appropriate (Prasad & Subhash, 2011).This is the reason that motivated us to carry out our study within a model-based framework.Further on this, one has the option of estimating the finite population total using a parametric or a nonparametric regression technique.Regression models give a general relationship between the response variable and the auxiliary variable.A linear regression estimate may however, produce a large error for every sample size if the true underlying function is not linear and cannot be well approximated by a linear function (Lászlό, A, Kohler, & Walk, 2002).To address this problem the non parametric regression estimation is the option to go for.The advantages as stipulated by (Hä rdle, 1994) include the fact that it provides a versatile method of exploring the general relationship between two variables; secondly it enables one to make prediction of observations without any reference to a fixed parametric model; thirdly it is a tool for finding spurious observations by studying influence of isolated points and lastly it is a flexible method for interpolating between adjacent values of the auxiliary variable.Notably, nonparametric estimation that use kernel densities, suffer from the boundary bias.

Statement of the Problem
Nonparametric regression estimation normally uses kernel smoothing technique.The Nadaraya-Watson estimator is the commonest of such smoothers.It has however, been known in literature that this technique induces substantial amount of bias in the estimate at the boundary.The focus of this paper therefore is to estimate the finite population total; T under the model-based framework using a technique does not suffer significantly from the boundary problem.This is developed in the next sections.

Introduction
Suppose we have a finite population of N distinct and identifiable units; . Let each population unit have the characteristic or variable of interest Y.It is assumed that there exist an auxiliary variable, X, closely associated with Y, which is known for the entire population.
Often researchers are faced with the problem of estimating a population parameters, for instance, the population total, , or the population mean Y among others.Studies in the distribution may be found in (Chambers, Dorfman, &   Wehrly, 1993)and (Dorfman, 1992).We will thus take a sample, S, so that we have (X i , Y i ), i= 1, 2, .., n.It will be assumed that X i 's are known for all elements of the population of interest and may be used in the design stage, estimation stage or both stages (Hedayat & Sinha, 1991).Below we review the sampling strategies that can be used.

Review of Estimation Approaches in Survey Sampling
The usual inference problem in sample surveys is to estimate some summary characteristic of the population, such as the mean or the total of the Y-values, after observing the sample only.Various statisticians with varying points of view have proposed different approaches which one can take to make the appropriate inference.These are:-Design-based (Randomization-based) approach, Model-Based (Prediction-based/super population) approach, Model-assisted approach and Randomization-assisted model-based approach.

Design -Based (Randomization -Based) Approach
In this approach the values of a variable of interest of the target population are viewed as fixed quantities (constants).This implies that the selection probabilities introduced with the design are used in determining the properties of estimators used to obtain expected values, variances, biases and so on.It is also known as classical approach.
The statisticians who have relied in the designbased methods like it for the capability of elimination of personal biases in selecting the sample and its use in situations where little may be known about the population.Most researchers here look for design-unbiased methods of estimation and mind less on the nature of the population itself.This approach describes the way the sample is selected and therefore the distribution is exactly known because the designer imposes it on the population.
It should, however, be noted that besides the above advantages, obtaining an optimal strategy under this approach might be an impossible task where no restriction on the sample size is made, a result first noted by (Godambe, 1955).Both robustness and optimality cannot be achieved under this approach.

Model-Based Approach
In this approach, the distribution, unlike the above, is a structure innate to the population itself and is unknown but capable of being modeled.In the modelbased approach or prediction theory inference, the relevant expectations are over all possible realizations of a stochastic model (usually a linear regression model), which connects a variable of interest Y with a set of auxiliary or benchmark variables X, (Cox, 1995).Statisticians using this approach view the values of interest in the population as random variables.
One area of sampling in which this super-population approach has received considerable attention is in connection with ratio and regression estimation.For example, in spatially distributed geological and ecological populations, the variable of interest of nearby units may be positively correlated, with the strength of the relationship decreasing with distance.When such tendencies are known to exist, they can be used in obtaining efficient predictors of unknown values and in devising efficient sampling procedures.This approach seems appropriate especially in sampling for resources, say, in which cost of sampling is high yet the economic incentive is strong for obtaining the most precise possible estimates for a given amount of sampling effort, as in the case of mining.When errors are modeled they are taken into account and in some way models provide for bias adjustment and assessment of the uncertainty of the estimates.Different modelsfor sample selection and for estimation can be developed.We, however, note that the choice of a model and its robustness to misspecification is the major issue.Small deviations from a chosen model may lead to serious errors in an inference.

Model-Assisted Approach
The two approaches reviewed above have their own individual strengths and weaknesses.Though for a long time they were viewed as rival approaches, some considerable researchers have made an attempt to view them as complementary and not as two competing approaches.Some references include (Brewer, 2002) and (Sä rndal, Swesson, & Wretman, 1992).Model-assisted approach is a method that still depends exclusively on randomization-based inference and estimators but optimizes them under the explicit assumption that the finite population under study is itself a sample drawn from a super-population generated by a specific stochastic model.Basically, inferences are design-based while the model serves as a vehicle to help choose between the randomization based methods.Because of this, the approach may also be referred to as model-assisted design-based approach.

Randomization-Assisted Model-Based Approach
As opposed to the approach in 1.2.3, this approach employs design-based method to simply protect against model failure, (Kott, 2005).Here inferences remain model-based and therefore the concern is with model-unbiasedness and not design-unbiasedness, (Langat, Odhiambo, & Odongo, 2007).
The four approaches highlighted above basically stems from the two strategies -the traditional Design-based approach which has its conceptual origin in the paper by (Neyman, 1934) and the sampling theory texts such as (Kish, 1965) and (Cochran, 1977), where inferences are based on the probability distribution induced by the sampling design with the population values being held constant and the Model-based approach strongly linked to Royall and his students, where inferences are model dependent.(Royall, 1970) gives a summary of the philosophy behind this approach.It should be noted that the nonparametric nature of the Design-based Approach can make it an obvious methodology to robust inferences; however, there are no relevant optimality criteria that can be checked under this approach (Chambers R. , 2011).Therefore, if one wants both optimality and robustness, the option is Model-based approach.
To remove the boundary effect in kernel estimation, a number of techniques have been developed in literature especially in density estimation.For an overview of these techniques, one can see for example (Karunamuni & Alberts, 2004).This paper explores the reflection technique in addressing the boundary problem in regression estimation context.

Outline of the Paper
In section 2, the Horvitz Thompson estimator (design-based estimator) and the ratio estimator are stated.This is to be used for comparison purposes with the nonparametric regression estimator for finite population total proposed under model-based framework.Nadaraya-Watson Kernel estimation has been reviewed and in particular the bias and the variance are stated.An overview of the reflection technique as a way of fixing the bias of the estimator is given in this section and the finite population total estimator using it proposed.In section 3 simulation studies and analysis is presented.Discussion of the results and conclusion is given in Section 4.

Nonparametric Regression Estimation
The interest is to obtain the finite population total: A famous design-based estimator for the population total is the (Horvitz & Thompson, 1952) estimator given by: Where I i =1 if the i th observation is in the sample and zero otherwise.
From the model-based framework, since we shall observe the units sampled, to estimate the population total is equivalent to estimating the non-sampled units and summing it with the observed units using an equation of the form: The non-sample can be estimated using a regression model of the form: This equation is parametric since the parameters  and  have to be estimated using the Least Squares Estimation technique.Parametric models are not flexible and therefore under such models, estimators obtained are not robust.It is known that under the parametric super-population, misspecification of the model can lead to serious errors in an inference as demonstrated in the empirical study by (Hansen, Madow, & Tepping, 1983).It is for this reason that many researchers use nonparametric approach.They include (Dorfman, 1992), (Chambers, Dorfman, & Wehrly, 1993), (Odhiambo & Mwalili, 2000), (Tsybakov, 2009), (Chandran & Prajneshu, 2004), and (Breidt & Opsomer, 2009) among others.
Another alternative estimator that can be used under this approach is the ratio estimator.The estimator of finite population total under simple random sampling (SRS) may be given by: is the sample total of the study variable while and   n i i x 1 is the equivalent for the auxiliary variable assumed to be known for the entire population.It is known that the ratio estimator is the Best Linear Unbiased Predictor (BLUP), (Cochran, 1977), (Cox, 1995) and (Brewer, 2002).
In non-parametric estimation, the data is allowed to determine the behaviour of the models, thus the only assumption made about the observations is that they are independent and identically distributed (i.i.d) from an arbitrary continuous distribution.A model-based non-parametric model is of the form: where Y i -is the variable of interest X i -is the auxiliary variable m-is an unknown function to be determined using sample data The idea of non-parametric regression goes back to (Nadaraya, 1964) and (Watson, 1964).Some of the current references include (Hä rdle, 1990), (Takezawa, 2006), (Gá miz, Kulasekera, Limnios, & Lindqvist, 2011) and (Tsybakov, 2009) among others.

Review of the Nadaraya-Watson Estimator
The idea of non-parametric regression has gained prominence in a couple of decades now.This section gives a brief derivation of Nadaraya-Watson estimator.
Let K(.) denote a kernel function which is also twice continuously differentiable, such that: Further, let the smoothing weight be: A form of the kernel weight defined as in (2.8) was proposed by (Nadaraya, 1964) and (Watson, 1964).Since then many researchers have explored the nonparametric regression technique in estimation.Some of the current references include (Hä rdle, 1990), (Takezawa, 2006), (Gá miz, et al, 2011) and (Tsybakov, 2009)  . is the summation over all the sampled units and h is the bandwidth also referred to as the smoothing or tuning parameter, with The Nadaraya-Watson estimator of m(x) is therefore given by: The non parametric estimator for the finite population total is thus given by: The estimator given in equation (2.10) was first suggested by (Dorfman, 1992).For kernel regression estimator, the estimate of m at point x is obtained using a weighted function of observations in the h-neighbourhood of x.The weight given to each observation in the neighbourhood depends on the choice of kernel function.
The bias is then given by: It can be shown that this is given by: And the variance is given by: (2.13) The derivation of this can be found in (Hansen, 2009).

The Proposed Estimator of Finite Population Total
An estimator of finite population total of the form given below is hereby proposed: where the first term   n i i y 1 is the sample total observed and therefore under model-based approach it will not be necessary to be estimated while the second term is the non-sample total term that is to be estimated non-parametrically using the reflection technique.As noted earlier the Nadaraya-Watson estimator induces a bias at the boundary.This is because at the boundary the interval where , the symmetric kernel has decreased amount or lacks data on part of its window.The data-reflected technique therefore provides the data through reflection method so that this information is put on the negative axis thereby supplying the kernel with the information required on this section.The following simple steps give the procedure on how it works.Let the {(X 1 ,Y 1 ), (X 2, Y 2 ),…, (X n ,Y n )} be the set of n observations in the sample.If the data is augmented by adding the reflections of all the points in the boundary, to give the set {(X 1 , Y 1 ), (-X 1 , Y 1 ), (X 2 , Y 2 ), (-X 2 , Y 2 ) ..., (-X n , Y n ), (X n , Y n )}.If a kernel estimate m*(x) is constructed from this data set of size 2n, then an estimate based on the original data can be given by putting , and zero otherwise.This gives the modified general weight function given by: It can be shown that the estimate will always have zero derivative at the boundary, provided the kernel is symmetric and differentiable.The estimate has also been shown under the section on properties of the data-reflected technique that it is a p.d.f for the symmetric kernel.In practice it will not usually be necessary to reflect the whole data set, since if Xi/h is sufficiently large, the reflected point -Xi/h will not be felt in the calculation of m*(x) for x> 0, and hence reflection of points near 0 is all that is needed.(Silverman, 1986) in his example, states that if K is the Gaussian kernel there is no practical need to reflect points beyond Xi > 4h.
The next section reviews some properties that are unique to this modified kernel density estimator.

The Kernel Estimator at the Boundary
The interest in this study is in the boundary problem which occurs in the interval [0, h).This is as a result of lack of information which follows due to truncation of such information at this interval, i.e. the density function is continuous on [0, ∞) and is 0, for x< 0. This reduced amount of information leads to serious bias during the estimation and as such the estimate becomes inaccurate.The boundary problem arises when the value of x is smaller than the chosen value of the bandwidth.In the case of the standard kernel estimator of , where x = c.h, then for For a kernel function which has the support [-1, 1], the variable z must lie within This implies that for the density estimation the expectation of the estimator is: Taylor's expansion yields: (2.17) For the case of regression estimation considered in this study, it can be deduced that (2.17) results in: (2.18)This estimator will only be unbiased and consistent asymptotically if 1 i.e.   c h x .The implication of this is that the expected value can only reach half the original value.
That is: It should be noted that: , and also that 1 ) , the following is obtained: an indication that the density does not live up to the condition of being a p.d.f about its support at the boundary.One way of correcting this boundary problem is by use of data-reflected technique.Due to symmetry of the kernel function one can look at the reflection estimator as: (2.21)

The Bias of Data-Reflected Estimation Technique in Regression
It can be shown that this reflection estimator being symmetric around the origin further has the condition: The implication of this is that the reflection estimator satisfies the so-called shoulder condition always.At the boundary the decreased amount of the data suggest concavity of the density in the vicinity of the origin.As a consequence, these kernels tend to misinterpret the local concavity as an indication of a mode over the strictly positive region, (Hirukawa & Sakudo, 2015).This is a condition where a given density, say m , has a shoulder at 0, i.e. .See for instance, (Mack, Quang, & Zhang, 1999).The graphs in Fig. 3.7 show this impression.The first term in the right hand side of expression (2.21) is already given above in (2.18); therefore, proceeding to look at the second term and noting that Taylor's expansion yields: Therefore the product of this expansion is: Because of the property of symmetry the following equality holds: Thus putting together the results in (2.24) with that of (2.18) yields in the following: The bias for estimator of the finite population total, npr T , given in equation (2.16) would therefore be given by: This clearly shows that within the boundary interval, the estimator still has a bias of order h while at the interior interval the expectation coincides with that of the standard kernel estimator.Notably, however, if the underlying density, m, has a shoulder at 0, i.e.
, the term of order h drops out thereby making the bias to be order 2 h .

The Variance of Data-Reflected Kernel Regression Estimation Technique
Similarly the variance can be computed as follows: The second term is zero, thus procedure of computing the first term is as follows: And from the property of symmetry of the kernel function, the following equality is obtained: With this, therefore, the variance is given by: From the derivation of the bias and the variance, it was noted that the estimated function always fulfills the shoulder condition.This condition unfortunately is imposed even for functions whose true density does not satisfy the shoulder condition.Further to this, is that while the reflection estimator has a low variance its bias is fairly high, but still better than the Nadaraya-Watson estimator one, where the shoulder condition is not satisfied.Though so, it should be noted that an impressive thing with this technique is that it is easy to calculate and at the same time very good for densities that fulfill the shoulder condition.
When the variance and the square of the bias term are summed up the MSE/AMSE error criterion is obtained.

Empirical Study
This section gives an empirical study that facilitates comparison between the two famous approaches in survey sampling-the Design-based and the Model-based approaches.Further to this, simulation study also compares the proposed estimator with the model-based ratio estimator and that due to (Dorfman, 1992).Six models given in table 3 have been used for this purpose.To achieve this, simulations were performed both for the response variable Y and the corresponding auxiliary variable X for a populations of size N=2000 from where 2000 simple random samples of size n=500 were drawn and used for estimation.In each case estimates of population totals were obtained using the Designed-based Horvitz-Thompson estimator, HT T ˆ, the other three model-based estimators, that is, the ratio estimator, R T ˆ, nonparametric regression estimator due to (Dorfman, 1992), np T ˆ, and the non-parametric regression estimator proposed in this study, npr T ˆ.The average relative biases of the finite population totals got using the three estimation techniques were obtained using the relation: where T is the actual population total and i T ˆis one of the estimators of the population total computed from the ith sample.A cross validation data generated bandwidth was used in the simulation.See table 1 for these summaries.HT , as well as the ratio estimator, T R .The bandwidth was selected using the data-driven cross-validation technique.Except for the ratio estimator in the linear model and the Horvitz-Thompson estimator in the bump model the proposed estimator proved superior of all in the average relative biases computed.It can also be seen that the unconditional MSEs are the smallest for the proposed estimator virtually beating all the rest of the estimators except for only the ratio estimator under the linear model where this fact can be attributed to the fact that it is the best linear unbiased estimator.

Conclusion and Recommendation
From the study, the proposed estimator gave very satisfactory results.The population totals arrived at were closer to the actual totals than the other techniques considered.This fact is evidenced by the small relative biases as well as the respective MSEs.Reflection technique can therefore be taken as a way of correcting the boundary bias in regression estimation.
among others.Herein, a simple Nadaraya-Watson Kernel estimator of m(x) has been considered.Assume a model of the form specified in (2.6), where    s

Figure 3
Figure 3.7.Shoulder condition: Except the 2 nd , these densities satisfy the condition

Table 1 .
Summary of respective estimators and their relative biases for population totals MSEs of the respective estimators and models were also computed.The results are reported in table 2. Table3gives the equations of the models simulated.

Table 2 .
Summary results for the unconditional MSE (Obtained from 2000 iterations and sample sizes of n=500) Table 1 and 2 presents the average relative biases of the various estimators studied i.e. the nonparametric regression estimator, T np , nonparametric regression estimator (with kernel modified), T npr , and the Horvitz Thompson estimator, T