Parametric versus Semi and Nonparametric Regression Models

Three types of regression models researchers need to be familiar with and know the requirements of each: parametric, semiparametric and nonparametric regression models. The type of modeling used is based on how much information are available about the form of the relationship between response variable and explanatory variables, and the random error distribution. In this article, differences between models, common methods of estimation, robust estimation, and applications are introduced. The R code for all the graphs and analyses presented here, in this article, is available in the Appendix.


Introduction
The aim of this paper is to answer many questions that researchers may have when they fit a real data, such as what are the differences between parametric, semi, and nonparamertic models? which one should be used to model a real data set? what estimation method should be used? and which modeling approach is better? These questions and others are addressed by examples in this paper and the R code for plots and analyses presented are available in the Appendix.
Assume that a researcher collected data about a response variable, y, and k explanatory variables, (x 1 , x 2 , . . . , x k ). The model that describes the relationship between y and X can be written as where β is a k × 1 vector of parameters, X = (x 1 , x 2 , . . . , x k ) is a n × k matrix of regressors value, is an error term, f (·, ·) is the function that describes the relationship between y and X.
The model choice from parametric, semiparametric or nonparametric regression model depends on the prior knowledge about the functional form of the relationship, f (·, ·), and the random error distribution. If the form is known and it is correct, the parametric choice can model the data set well. However, if a wrong functional form is chosen, this will result in larger bias as compared to the other competitive regression models (Fan and Yao, 2003). The most common functional form is the linear model, as a type of parametric regression, which is frequently used to describe the relationship between a dependent variable and explanatory variables. Parametric linear models require the estimation of a finite number of parameters, β.
Parametric models are easy to work with, estimate, and interpret. So, why are semiparametric and nonparametric regression important? Over the last decade, increasing attention has been devoted to these regression models as new techniques for estimation and prediction in different areas, such as epidemiology, agriculture, economics, . . . , etc. Nonparametric regression analysis relaxes the assumption of linearity or even knowledge about the functional form priori in regression analysis and enables one to explore the data more flexibly. However, in case of high dimensions, the variance of the estimates rapidly increases, this is called curse of dimensionality, due to the sparseness of data. To solve this problem, semiparametric models have been proposed, such as single index model (SIM). The three modeling approaches are compared in terms of fitting and prediction (Rajarathinan and Parmar 2011;Dhekale et al. 2017). Mahmoud et al. (2016) showed that if the link function in generalized linear models is misspecified, semiparametric models are better than parametric models. In Section 2, the three types of models are introduced. In Section 3, which model should be used is addressed. In Section 4, common estimation methods for semiparametric and nonprametric models are displayed. In Section 5, the multiple case is presented. Robust estimation method for the semi and nonparametric regression is introduced in Section 6. Section 7 includes discussion and conclusion.

Parametric, Semi and Nonparametric Regression Models
To differentiate between the three types of regression models, without loss of generality, assume that we have a response variable, y, and two explanatory variables, x 1 and x 2 . The regression model that describes the relationship between the response variable and two the explanatory variables can be written as: where β 1 and β 2 are the model parameters that need to be estimated, and f 1 (·, ·) and f 2 (·, ·) are the functions describe the relationships between y and the two explanatory variables, x 1 and x 2 .

Parametric Models
Parametric models are models in which the vector of parameters, β in model (1), is a vector in finite p−dimensional space (the number of parameters, p, may be less or greater than the number of explanatory variables, k). Our interest in this case is estimating the vector of parameters. In the parametric models case, the researcher assumes the form of the model and its assumptions. Applying on model (2), f 1 (·, ·) and f 2 (·, ·) need to be known functions and distribution is known priori for inference purposes. For example, the researcher may assume they are linear and the error term follows the normal distribution. In general, f 1 (·, ·) and f 2 (·, ·) can be assumed linear or nonlinear in terms of the β's. For model validity, after fitting the data using the assumed model, the researcher needs to check whether the assumptions are met by using residuals.

Linear Models
Linear models are linear in parameters (i.e., linear in β's). For example, the polynomial regression that is used to model curvature in a data set, by using higher-ordered values of the predictors, is a linear model in β's. Hence, the final regression model is a linear combination of higher-ordered predictors. There are many examples of linear models, such as • y = β 0 + β 1 x 1 + β 2 x 2 + (Multiple linear regression model), • y = β 0 + β 10 x 1 + β 11 x 2 1 + β 20 x 2 + β 21 x 2 2 + (Polynomial regression model of the second order), • and log(µ) = β 0 + β 1 x 1 + β 2 x 2 (Poisson regression in case the response variable, y, is count data).
These models are linear in parameters and the number of parameters is greater than the number of the explanatory variables with one (p = k + 1). For this setting, there are many methods for parameters estimation, such as the ordinary least square method and maximum likelihood method. The main interest of the researcher, in this case, is estimating the vector of parameters. Once he estimated the parameters, everything is straightforward afterwards, such as inference and prediction.

Nonlinear Models
The nonlinear regression models are parametric models and f (·, ·) is known but nonlinear in parameters. Below are some examples of nonlinear models For nonlinear regression models case, the nonlinear least square method can be used to estimate the model parameters.
There are several algorithms for the nonlinear least squares estimation, such as Newton's method and Gauss-Newton algorithm.
1. Newton's method is based on a gradient approach. It can be computationally challenging and dependents on initial values (Atkinson, 1989).
2. Gauss-Newton algorithm is a modification of Newton's method that gives an approximation of the solution that Newton's method but it is not guaranteed to converge (Nocedal and Wright, 1999).
There are some nonlinear models that can be linearized in parameters using a transformation. For example, can be written as 1 The last equation is linear in parameters, so it is a linear model. If a researcher found that the relationship between the response and explanatory variable is not linear, a simple solution may be used to fix the problem before considering the nonlinear, nonparametric, or semiparametric modeling, such as • applying a transformation to the dependent and/or independent variables, such as a log transformation, square root, or power transformation.
• adding another regressor as a function of one of the explanatory variables. For example, if a researcher consider regressing y on x, it may make sense to regress y on both x and x 2 (i.e., x-squared).
Practically, the parametric models do not work well in modeling data in many cases of applications, they have very strong assumptions (Rajarathinan and Parmar (2011).

Nonparametric Models
The parametric and nonparametric regression models differ in that the nonparametric model form is not specified a priori but is instead determined from the data set. The term nonparametric does not mean that such models are completely lacking parameters, but that the number of parameters is flexible and is not fixed a priori. In parametric models, the vector of the parameters, β, is a vector in a finite p− dimensional space and our main interest is estimating that vector of parameters. In contrast, in nonparametric models, the set of parameters is a subset of infinite-dimensional vector spaces. The primary interest is in estimating the infinite-dimensional vector of parameters. In nonparametric regression models, the relationship between the explanatory variables and response is unknown. Applying that on model (2), f 1 (·, ·) and f 2 (·, ·) both are unknown functions. These functions can take any form but they are unknown to the researcher, they may be linear or nonlinear. In parametric modeling, the researcher knows exactly the model of the data. In nonparametric modeling, the data shows what the regression model should be used; the data decide the form of the functions, f 1 (·, ·) and f 2 (·, ·).

Semiparametric Models
Semiparametric modeling is a hybrid of the parametric and nonparametric approaches of statistical models. It may appear at first that the semiparametric model contains the nonparametric model case, however, the semiparametric model is considered to be "smaller" than a completely nonparametric model because we are often interested only in the finitedimensional component of β. By contrast, in nonparametric models, the primary interest is in estimating the infinitedimensional parameter. As a result, the estimation is statistically harder in nonparametric models compared to semiparametric models.
While parametric models are being easy to understand and easy to work with, they fail to give a fair representation of what is happening in the real world. Semiparametric models allow us to have the best of both worlds: a model that is understandable and offering a fair representation of the messiness that is involved in real life. The semiparametric regression take several structures. One structure is a form of regression analysis in which a part of the predictors do not take predetermined forms and the other part takes known forms with the response. For example, in model (2), f 1 (·, ·) may be known and f 2 (·, ·) is unknown. In case the known function is linear, the model can be written as Another structure of the semiparametric regression modeling, which is a well-known example, is the single index model that is extensively studied and has many applications. In general, it takes the form where f (·) is an unknown function, X = (x 1 , . . . , x k ) is a n × k matrix of regressors values, β is a k × 1 vector of parameters, and E( |X) = 0. The term Xβ is called a "single index" because it reduces the p−dimension into one dimension. In this case, the functional form of f (·) is unknown to the researcher. This model is semiparametric since the functional form of the linear index is specified, while f (·) is unspecified. In the single index model, the explanatory variables affect the response variable through the index (Xβ). The SIM is extensively studied and applied in many different fields including biostatistics, medicine, economics, financial econometrics and epidemiology (Ruppert et al. 2003;Mahmoud et. al 2016Mahmoud et. al , 2019Toma and Fulga 2018;Li, et al. 2017;Qin et al. 2018).
SIM is more flexible compared to parametric models and does not lack from the curse of dimensionality problem compared to nonparametric models. It assumes that the link between the response and the explanatory variables is unknown and should be estimated nonparametrically. This gives the single index model two main advantages over parametric and nonparametric models: (1) It avoids misspecifying the link function and its misleading results (Horowitz and Hardle, 1996) and (2) the reduction of dimension which is achieved by assuming the link function to be a univariate function applied to the projection of explanatory covariate vector onto some direction. To fix the identifiability problem of the SIM, the coefficient of one of the continuous explanatory variables is set to be equal to 1 (Ichimura 1993, Sherman 1994 or ||β||=1 is used (Lin andKulasekera 2007, Xia et al. 2004).
Model (2) can be written in the form of single index model as Figures 1-4 show different relationships between a response variable and an explanatory variable. For Figure 1-3, the relationship between the response and explanatory variable is not linear, so a researcher may use the nonparameteric model to fit this data or try to find a polynomial regression model that can fit the data that is because there is no known model for these data sets. Figure 4 shows a linear relationship between the two variables so the linear parametric model can be used to fit the data.

Which Type of Modeling You Should Use to Fit Your Data?
The first step in statistical analysis is summarizing the data numerically and graphically of the response (dependent) variable and the explanatory (independent) variables. The researcher should look at scatter plots, boxplots, histograms, and numerical summaries. That is to get an initial information about the data and see whether the model assumptions are met and whether there are outliers. Figure 5 shows that the relationship between age and the log of wage is nonlinear. It displays the fitted lines of linear, quadratic, cubic and the 4th degree of polynomial regression along with the scatterplot.
Assume that a researcher used the linear model to fit the data. So to check the linearity assumption, the researcher needs to look at the residual plots. Figure 6 shows that the linearity assumption is not satisfied because fitted values versus residuals plot reveals a nonlinearity form and also normality is not satisfied that is from the the Normal Q-Q plot. Comparing the linear model to 4 th degree polynomial model, Figure 7 shows a better fitted versus residuals graph, and normality of residual is much improved. In addition, we need to look at the p-value associated with the fitted model. For the linear model, p-value = 0.0008407, which is significant, however, it is clear that the linear model is not suitable to fit this data and R-squared = 0.05357 is very small. For the 4th degree polynomial regression, p-value = 1.202e-15, R-squared = 0.315.

Estimation Methods of Semi/Nonparamteric Regression Models
Estimating the unknown function nonparametrically means that the data itself is used to estimate the function, f (·, ·), that describes the relationship between the explanatory variables and the response. There are two commonly used approaches to estimate the nonparametric regression term: 1. Kernel Regression: estimates the conditional expectation of y at a given value x using a weighted filter to the data.

Spline
Smoothing: minimize the sum of squared residuals plus a term which penalizes the roughness of the fit.

Kernel Regression
Kernel Regression smoothing is one of the most popular methods for nonparametric kernel regression estimation that was proposed by Nadaraya (1965) and Watson (1964) and it is known as the Nadaraya-Watson estimator (also known as the local constant estimator), though the local polynomial estimator has emerged as a popular alternative. For more details in kernel smoothing, see Wand and Jones (1995). The local polynomial regression is an approach that combines the simplicity of the linear least squares regression with the flexibility of nonparametric regression by fitting simple models to localized subsets of the data to build up a function that describe the unknown function of the regression model, point by point. At each point in the data range, a p-degree polynomial is fitted to a subset of the data, with explanatory variable values near the point whose response is being estimated. The polynomial is fitted using weighted least squares by giving more weight to points near the point whose response is being estimated and less weight to the points further away. The subsets of data used for each point is determined by an input to the procedure called the "bandwidth" that determines how much of the data is used to fit each local polynomial. If the local polynomials that is used of the first degree, so it is locally linear (a straight line is fitted) and if a zero-degree polynomial is used so it is a weighted moving average or a local constant.
Assume that we have the following model: One can show that the pth-order Taylor series expansion for the nonparametric function, f (x i ) , assuming that the (p + 1)th derivative of the conditional mean at the point x exists, can be written as This equation can be estimated by minimizing the following quantity: This quantity is a function of the derivative of the unknown function which is also unknown. A solution for this problem is to set β j = ∂ j f (x) j! and, in this case, the above quantity will be a linear regression problem and can be rewritten as Using the weighted least square method, the estimation of the nonparametric function is the estimate that minimizes the following quantity When p = 0, equation (8) will be and the estimate of the nonparametric function at the value x is the weighted average of y 1 , y 2 , . . . , y n . This estimator is called Nadaraya-Watson estimator or local constant and takes the form: where K is a Kernel function (weight function) with a bandwidth h. The K function assigns weights to the values around x and these weights decline as the values get further from the target value (x). When p = 1, equation (8) will be or and the solution of this regression equation parameters (β 0 , β 1 , β 2 ) is called the local polynomial estimator which is affected by the smoothing parameter h that is called "bandwidth". Loader (1999) explained the cross validation (CV), , and the generalized cross-validation (GCV) methods that can be used to select the bandwidth. In case the Gaussian function is used, the optimal h that minimizes the cross validation quantity takes the form h = 4σ 5 3n 1 5 ≈ 1.06σn −1/5 , whereσ is the standard deviation of the samples and n is the sample size.
Some of the common kernel functions are listed below where K(x i ) is the kernel function that assigns a weight for x i (i = 1, 2, . . . , n) based on its distance (d) between x i and the target value x.
Bandwidth has an impact on the estimation of the nonparametric function. Figure 8(a) shows the idea of bandwidth and Figure 8(b) shows the effect of the bandwidth on the smoothness of the estimated function. The derivative of the smoothed function can be used to see whether the curvature is significant. Figure 9 shows the smoothed estimated function and its derivative along with 95% confidence interval using the optimal value of bandwidth. The figure shows that the function increases and at some point it gets constant in slope.
The behavior of the Nadaraya-Watson regression estimator (p = 0) for points in the boundary region is unsatisfactory compared to the local polynomial estimator (p = 1). This phenomenon is referred to as "boundary effects". Kheireddine et al. (2015) showed under some conditions: E(y 2 ) ≤ ∞, E(x 2 ) ≤ ∞, and f (x) is twice continuously differentiable in a neighborhood of x, that the bias of the Nadaraya-Watson is and of the local linear estimator is where K is an integrable smoothing nonnegative kernel function, f (x) is a nonparametric function, and f y (x) is the data density function.
By comparing these two formulas of bias, one can conclude that the bias decreases with the bandwidth h for both the local constant and the local polynomial estimators. The smaller the bandwidth value, the smaller the bias. In addition, the local International Journal of Statistics and Probability Vol. 10, No. 2;2021 constant estimator is affected by ∂ ∂x f (x), ∂ ∂x f y (x), and f y (x) which are not present in the bias formula of the local linear estimator.

Spline Smoothing
Spline is a piecewise polynomial. The polynomials of order j are pieced together at a sequence of knots (θ 1 < θ 2 < ..... < θ C ) such that spline and its first j − 1 derivative of the spline are continuous at those knots. In spline estimate, the unknown function is approximated by a a power series of degree p, f ( In spline estimation, the vector of parameters, β, is obtained by minimizing the following quantity: where λ is a tuning parameter controls smoothness andf (·) is the second derivative of the nonparametric function. There are many methods for choosing λ, such as CV and GCV (Wahba 1977). For more details in spline smoothing, see Wang (2011).
How many knots need to be used? and where those knots should be located? A possible solution is to consider fitting a spline with knots at every data point, so it could fit perfectly and estimate the parameters by minimizing the usual sum of squares plus a roughness penalty. The smoothing parameter λ has the same effect on smoothing such as h in the kernel smoothing approach. When λ is very big, the smoothing line would be similar to linear regression line and when it is small, the smoothing line would be wiggly, more fluctuated. Figure 10(b) shows the effect of the smoothing parameter on the smoothness of the estimated function and Figure 10(a) shows the smoothed function at the optimal value of λ. Figure  11(a) shows the Spline smoothed estimated function using the optimal λ along with 95% confidence interval. Spline smoothing of the unknown function and its derivative can be used to see whether the curvature is significant. Figure 11(b) shows the smoothed derivative function along with 95% confidence interval using the optimal value of λ.
(a) (b) Figure 10. The estimate of the unknown function using the optimal λ that is determined by GCV criteria (a), and the estimates at different values of the penalty, λ There are many R functions that can be used to estimate the semi/nonparametric regression models by the kernel and smoothing splines techniques. Table 1 displays the R packages and functions for estimating the nonparametric function.

Multiple Case
The kernel and spline smoothing approaches can be extended to any number of explanatory variables. Assume that a researcher wants to study the prestige as the response variable with education and income as the explanatory variables. Figure 1 shows that the relationship between education and prestige is linear and the relationship between income and prestige is nonlinear (unknown). So the following semiparametric model can be assumed where f (·, ·) is an unknown function. Smoothing splines or kernel regression can be used to fit this nonparametric term. Figure 12 shows the estimated two functions: a linear function of prestige and education and the nonparametric function  Figure 11. Estimate of the unknown function (a), and its derivative with 95% confidence interval (b) using the spm function in R software at the optimal smoothing parameter, λ, determined by REML criteria. Number of knots used is equal to 10 and λ = 25.65 of prestige and income. Also the two relationships can be assumed unknown and the following model can be assumed where f 1 (·, ·) and f 2 (·, ·) are nonparametric functions. Figure 10 show the estimated nonparametric functions assuming both relationships or functions are unknown. Figure 12. Estimated relationship between education, income and prestige using smoothing spline for the model y = β 0 + β 1 education + f 2 (income, β 2 ) +  Figure 13. Estimated relationship between education, income and prestige using smoothing spline for the model y = f 1 (education, β 1 ) + f 2 (income, β 2 ) + As another example for the multiple case, a real data from Wooldridge (2019) is considered. This data has quantitative and categorical variables measured for 526 observations. The data is available in R in "np" R package. The response variable is log(wage) and the explanatory variables are: educ (years of education), exper (the number of years of experience), and tenure (the number of years of current employment), and female (Female or Male). The scatterplots in Figure 14 show the relationships of the explanatory variables with the response. A researcher can assume the relationships between the response and quantitative variables are unknown and estimate them nonparametrically by fitting the following model where the categorical variable (female) is considered as a factor and the the form of the other relationships, f 1 (·, ·), f 2 (·, ·), and f 3 (·, ·), are unknown. The output is displayed below and the smoothed functions are displayed on Figure 15.
where f (·) is an unknown function, and β T = (β 1 , . . . , β 4 ) is the single index coefficients. The estimated parameters, R-square, and optimal bandwidth are displayed below and the smoothed fitted function,f (·), is showed in Figure 16.

Robust Nonparametric Estimation
Outliers may affect the kernel smoothing or spline smoothing estimation. So robust estimation is needed in this case.
To estimate the unknown function at a point value x under the kernel smoothing method, the pth-degree polynomial is estimated using the weighted least squares method and the fitted value is calculated. Nadaraya-Watson estimator can be used as described in equation (13) and the quantity in equation (11) is minimized to obtain the estimate at the value x.
To downweight the effect of the unusual values, the kernel weights need to be adjusted by incorporating the residuals in calculating the weights. In this case, a low weight will be assigned to large residuals. The general method for the robust linear model is m-estimation that is introduced by Huber (1964). This method can be developed for the nonparametric model case as follows.
By multiplying the kernel weights that control the smoothness by assigning weights that are proportional to the distance between the target value x and the other values x i (i = 1, 2, . . . , x n ) by the weights that assigned for the values based on their residuals, big weights for small residuals and small weights for large residuals, we obtain robust weights that take the following form: where K d x i (x) is the kernel weight that is assigned to x i based on its distance from x, and K r x i (x) is the weight that is assigned to x i based on its residual. R j (x) is expected to downweight the effect of unusual values on estimation. The weight based on residuals can be obtained from the derivative of some appropriately chosen smooth convex loss function such as Huber or bisquare function. Below are the forms of these two functions: Huber function Bisquare function How to obtain K r x i (x)? It is a function of the rescaled residuals and takes the following form where y i is the ith observation of the response variable,f (x i ) is the estimated function at the value x i ,ŝ is a robust estimation of the variance, such as the interquartile range of the residuals, and r i = [y i −f (x i )]/ŝ is the rescaled residual of the value x i . When x i is an outlying value, r i would be large, so the weight K r x i (x) would be small. To evaluate the performance of the robust weights compared to kernel weights, 100 observations are generated form f (x) = {1 + e −20(x−0.5) } (−1) where x is generated from Uniform(0, 1) and two outliers or unusual data points are added at (0.8, 0.6) and (0.75, 0.62) manually to the data. Figure 17 shows the estimated function with the outliers that are added to the data using the kernel smoothing, spline smoothing, and kernel robust weights. It reveals that the kernel and spline smoothing are not robust within the interval that contains the outliers but the robust estimation is not affected by these added outliers.