Semiparametric Marginal Models for Binary Longitudinal Data

In this paper, we propose and explore a semiparametric approach to analyzing longitudinal binary data often observed in clinical studies. We applied second-order GEE approach to analyze longitudinal binary responses based on a partially linear single-index model. We use a local polynomial smoothing technique to estimate the single-index. We study the empirical properties of the proposed estimators using simulations. The empirical results demonstrate that if the true underlying model is partially linear, then our proposed method generally provides unbiased and efficient estimators. The proposed method is also applied to some real data sets obtained from longitudinal studies.


Introduction
The commonly used linear regression paradigm models a dependent variable Y as a linear function of a vectorvalued independent variable x in the form E[Y|x] = x ′ β, for a vector of unknown parameters β.Generalized linear models (GLMs) provide a flexible extension of linear models by assuming that the dependent variable Y is of the form E[Y|x] = φ(x ′ β), for some known inverse link function or transfer function φ, which includes the commonly used logistic regression with φ(z) = 1/(1 + e −z ).
The case when both φ and β are unknown is referred to as single-index models (SIM).These models involve challenging problems of jointly estimating φ and β, where φ may come from a large non-parametric family such as the family of all monotonic functions.
In practice, the relationship between the response and covariates may be very complex and linear terms may not be adequate to feature that relationship.Under these circumstances, it might be preferable to include a linear term x ′ β and a nonlinear term φ(u ′ ξ) in a semiparametric regression, where φ is a smooth but unknown function.These models are a natural combination and generalization of simpler models already in the literature, namely single-index models and partially linear models.Carroll et al. (1997) called it generalized partially linear singleindex models (GPLSIM).Li (1991) noted that if φ is monotone, then ξ takes on the same general meaning as "effect" parameters as would occur in ordinary linear models.Furthermore, it is a widely applied method to include a nonparametric function into the model for covariates u that have large dimension and are of little interest (e.g., confounders).This allows us to make inference on the effects of x while making minimal assumptions on u.Marginal semiparametric models and their various extensions have become increasingly popular (see, for example, Carroll et al., 1997;Fan and Li, 2004;Wang et al., 2005;and Yi et al., 2009;among others).Carroll et al. (1997) propose estimates of the unknown parameters (β, ξ) and unknown function φ using quasi-likelihood and local polynomial regression and obtain their asymptotic distributions.However, they did not consider the joint estimation of the marginal mean parameters and association parameters.
In many applications, simply working on the marginal mean responses could be very restrictive.Estimation of the association parameters may be the central theme of the study.For example, in familial studies of inherited traits and developmental toxicology studies of laboratory animals, subjects in a family or cluster share common genetic traits or are subject to common environmental factors, and hence it is of prime scientific interest to study the association between responses.
In this paper, we propose a method based on Prentice's (1988) second-order GEE approach to analyze longitudinal binary responses under partially linear single-index models.The computing algorithm for solving usual estimating equations based on the Newton-Raphson method cannot be directly employed due to the inclusion of a nonlinear function whose form is unknown.To circumvent this problem, following Carroll et al. (1997), we use a local polynomial smoothing technique to perform estimation of the single-index.We study the performance of the proposed method using simulations.
The paper is organized as follows.Section 2 introduces partially linear single-index models for analyzing binary longitudinal data.Section 3 describes the proposed second-order GEE approach for fitting single-index models.Section 4 discusses asymptotic properties of the proposed semiparametric GEE estimators.Section 5 presents results from a simulation study, which was carried out to investigate the empirical properties of the proposed estimators.Section 6 presents applications of the proposed method using two real data sets obtained from longitudinal studies.Section 7 gives the conclusions of the paper.question.

Single-Index Model for Binary Response
Suppose in a longitudinal study, there are N subjects and each subject is observed at a fixed set of T time-points.Let Y it denote a binary response variable from subject i observed at time t for i = 1, . . ., N and t = 1, . . ., T .For subject i, we have a T ×1 response vector Y i = (Y i1 , . . ., Y iT ) ′ .Let y it and y i denote the realizations of the responses Y it and Y i , respectively.Also, let x it = (x it1 , . . ., x itp ) ′ and u it = (u it1 , . . ., u itq ) ′ denote p × 1 and q × 1 vectors of covariates, respectively, from subject i at time t, which may be time-dependent or fixed across the observation times.Define ′ as the vectors of covariates for subject i. Assume that the marginal distribution of Y it is Bernoulli: with the probability of success, Denote p i = (p i1 , . . ., p iT ) ′ .Assuming that the mean of response variable Y it depends only on the covariate vector for subject i at time t, i.e., (Pepe and Anderson, 1994), we consider modelling the mean response by a partially linear single-index logistic regression model in the form where β and ξ are p × 1 and q × 1 unknown parameter vectors, respectively, and φ is an unknown smooth function.
The restriction ||ξ|| = 1 ensures identifiability of ξ (Carroll et al., 1997).Model (3) generalizes the usual logistic regression in the sense that a nonlinear term, φ(u ′ it ξ), is included in the model.If φ(.) is specified as the identity function, then model (3) becomes an ordinary logistic regression model with a known link function.
The marginal variance of the response variable Y it is specified as a function of the marginal mean as We assume that Y it and represent the correlation between the responses Y it and Y it ′ for the given vector of covariates x i .Denote α = (α 12 , . . ., α 1T , α 23 , . . ., α T −1,T ) ′ as the vector of correlation parameters.

GEEs When φ is Identity
If one naively assumes that the single-index φ is identity, then the regression and association parameters may be estimated by using the ordinary GEEs (Liang and Zeger, 1986).In this case, estimates of the regression parameters (β, ξ) may be obtained by solving the estimating equations where R(α) is a correlation matrix for Y i depending on the vector α of correlation parameters.The above equations can be solved numerically for β and ξ using an iterative method.Liang and Zeger (1986) considered estimating the association parameters α by the method of moments, which uses the Pearson residuals rit The moment estimators of α tt ′ may be obtained as where p and q are the dimensions of β and ξ, respectively.Prentice (1988) considered an extension of the GEE approach to allow joint estimation of the regression parameters (β, ξ) and the association parameters α.Specifically, a GEE estimator of the correlation parameter α may be obtained from a second set of estimating equations by noting that the "sample correlation" has mean ρ itu and variance for t < u < T , i = 1, . . ., N, and t = 1, . . ., T .Let Then following Prentice (1988), the second-order GEE (GEE2) estimators of (β, ξ, α) may be obtained by solving the estimating equations where G i = ∂ρ i /∂α and W i = diag{w i12 , . . ., w i1T , w i23 , . . ., w i,T −1,T }.

Semiparametric GEEs When φ is Unknown
An unknown smooth function φ leads to a partially linear single-index model, as defined in (3).For this, the estimating functions U β,ξ and U α in ( 11) and ( 12) involve the unknown function φ.We propose to use a nonparametric approach to estimating φ locally prior to estimating the regression parameters (β, ξ) and association parameters α.
Assuming that the function φ(c) has a second derivative, we may approximate φ(c) by a locally linear function within the neighborhood of c 0 via the Taylor series expansion, We also denote ) is a t × 2 matrix with the tth row ϕ it (c, ξ), and ∆ i = diag{p (1)  it , t = 1, . . ., T }, where p (1) it is the first-order derivative of the mean function p it evaluated at x ′ it β + φ(u ′ it ξ).Let K(c) be a kernel function (or a symmetric density function) with a compact support and h be a bandwidth.
We describe below an algorithm for simultaneous estimation of the unknown function φ, mean parameters (β, ξ) and correlation parameters α.We adopt Prentice's (1988) second-order GEE approach to conduct simultaneous estimation of (β, ξ) and α.
Step 3. Repeat Steps 1 and 2 until the convergence of ( β, ξ, α).The estimates at convergence are referred to as the semiparametric second-order GEE (SGEE2) estimates.
When implementing the aforementioned algorithm, it is often feasible to obtain a set of initial values by fitting a model with the ordinary GEE approach assuming that the single-index φ is the identity function.Our computational experience suggest that the algorithm is not severely but somewhat sensitive to the initial values, and the above choice of initial values ensures a convergence in estimation in most cases.
We conclude this section with a discussion on the bandwidth selection.As the bandwidth h affects both the bias and variance of an estimator, there is always a trade-off between these two inference criteria.Bias correction requires the choice of a relatively small bandwidth, whereas a smaller variance estimate needs a larger value of the bandwidth.In principle, the bandwidth selection is data driven, and traditional methods such as the crossvalidation approach may be applied to select a proper bandwidth h based on available data.Carroll et al. (1997) noted that a sensible choice of the bandwidth h is generally difficult.Instead, they suggested an ad hoc bandwidth, given by ĥopt × N −2/15 = O(N −1/3 ) , which satisfies Nh 4 → 0 and Nh 2 /(log N) 2 → ∞.

Computational Details for Semiparametric GEEs
To estimate the regression parameters (β, ξ) and association parameters α, in Step 1 of our algorithm, we estimate the single-index parameter φ by solving the estimating equations (13).For this, we can write and where log{ pit , and K(.) is a kernel density function (e.g., standard normal density).We use a Newton-Raphson iterative algorithm for estimating φ.For this, we find where the matrix H is defined by Given a set of current estimates η(s for s = 0, 1, 2, . .., where the second term on the right side is evaluated at the current estimates η(s) = { η0(s) (c), η1(s) (c)} ′ .

Asymptotics for φ
The asymptotic distribution of the single-index estimator φ(c) may be established under the following assumptions (see Carroll et al., 1997, for details): i) The density function of u it has a continuous second derivative on its support.
ii) The density function of u ′ it ξ is positive and uniformly continuous for ξ in a neighborhood of its true value.iii) The second-order derivative φ (2) (c) is continuous on its support.iv) The random vector x it is assumed to have a bounded support with E(x ′ it x it ) > 0. v) K( .) is a symmetric probability density function with bounded support.
From Eq. ( 13) we can write, Denote the right side of ( 26) by , where Then by a Taylor series expansion, we get Thus from ( 26) and ( 27), we get Let the marginal density of u ′ ξ be denoted by f (.).Applying the asymptotic properties of the kernel estimators and the Taylor series expansion, following Carroll et al. (1997) and Yi et al. (2009), we find the asymptotic expansion where Ω it is the first element of the vector ( As a consequence, the asymptotic distribution of the single-index estimator φ(c) is obtained as where d(c) is the first diagonal element of the matrix Ω −1 (c).

Asymptotics for ( β, ξ, α)
Applying the first-order Taylor series approximation, from Eqs.( 14)-( 15) the vector of estimators From this, we have Var with where the variance and covariance terms may be approximated by Moreover, if φ is consistent, then the linear functions L generally have an asymptotic multivariate normal distribution with mean vector zero and covariance matrix Thus the joint asymptotic distribution of is multivariate normal with mean vector zero and covariance matrix and We can show that To find the derivative ∂ Ẑitu /∂ξ j in (42), we can replace ∂ pit /∂β j and ∂ piu /∂β j by ∂ pit /∂ξ j and ∂ piu /∂ξ j , respectively.
Hence as N → ∞, we can write and where

Simulation Study
To study the relative performance of the proposed SGEE2 method as compared to the ordinary GEE2 method, we ran two sets of simulations.In the first set, the estimators were studied for the case when the true model is a linear single-index model.In the second set, we considered the true model as a partially linear single-index model.

Response Model
We generated the data by considering a two-group design configuration with a binary response measured on four occasions.The marginal mean response where x i is a dichotomous covariate indicating the group membership for the ith individual (i = 1, . . ., N) observed over a fixed set of T = 4 time-points.We considered P(X i = 1) = 0.5 throughout the simulations.
The data were generated from two different models.In the first model, covariates u i j 's were generated from the uniform distribution U(0, 1), and parameters were fixed at β = 1 and ξ In the second model, covariates u i 's were chosen as the indicators of time-points, with u jit = 1 if t = j ( j = 1, 2, 3) and u jit = 0 otherwise.The parameters were fixed at β = 1 and ξ Table 1.Comparison of Prentice's (1988) ordinary second-order GEE (GEE2) with the proposed semiparametric approach (SGEE2) when u i 's are continuous.[True parameter values: To assess the performance of our proposed semiparametric (SGEE2) approach, we compare it with the ordinary second-order GEE (GEE2) approach of Prentice (1988) under two situations of data structures.First, we consider a scenario where a standard GEE model well fits the data with a linear single-index model.As a linear function is a special case of a nonlinear function, we expect that the proposed estimators would still be consistent, but there may be a possible efficiency loss incurred.In order to generate the data, φ is specified as the identity function.Second, we consider a scenario when our proposed method (SGEE2) well fits the data with a partially linear single-index model where we chose φ(a) = sin(π(1 − a)).
Throughout the simulations, data were generated under exchangeable correlation structures among the responses and we chose corr(Y it , Y it ′ ) = α = 0.3 for all (t t ′ = 1, . . ., T ).Each simulation run was based on 1000 replications of data sets, with each data set containing N = 100 subjects and T = 4 observations per subject.

Diagnostic Methods
The two methods were compared based on empirical biases and mean squared errors (MSEs) of the estimates.The bias of an estimator θ of θ is estimated by bias where θs is the estimate of θ obtained from the sth simulated data set and S is the simulation size.The mean squared error (MSE) of θ is estimated by

Results
Table 1 presents the empirical biases and mean squared errors (MSEs) of the estimators of the regression parameters (β, ξ 1 , ξ 2 , ξ 3 ) and the correlation parameter α for continuous covariates u i 's under both methods (GEE2 and SGEE2).Table 2 repeats these results for discrete covariates u i 's.
Both Table 1 and 2 show very similar results from the two methods when the true model is single-index linear.
Although the finite sample biases from GEE2 tend to be smaller than those obtained from SGEE2, the biases from SGEE2 are still reasonably small.Also we notice that SGEE2 tends to produce larger MSE as compared to GEE2, but the differences do not seem considerable.In summary, if a nonparametric function is included to the mean response, where the true model is actually characterized by an ordinary single-index linear model, the proposed method would still provide consistent estimates, though some efficiency loss may incur.
On the other hand, if the true underlying model is partially linear but we adopt a standard second-order GEE (in this case, Prentice's (1988) approach) model, then the resulting estimators could be biased.The biases for ξ estimates are apparent in Table 1 when covariates u i j 's are generated from a continuous distribution.We also observe that SGEE2 provides lower MSE for the estimators of the linear coefficient β, but the difference is not profound.However, lower MSE from SGEE2 is more apparent for the estimators of ξ and α, which is not surprising.
In Table 2, the two methods appear to provide similar results when the true underlying model is partially linear.This is perhaps due to the fact that the non-linear covariates are all discrete binary variables.As a consequence, the estimated single-index φ(u ′ ξ) is not a smooth function.Hence our proposed method works simply as an ordinary GEE method for this type of data set.

Analysis of ICHS Data
Alfred Sommer and colleagues conducted a study (which we will refer to as the Indonesian Children's Health Study or ICHS) in the Aceh province of Indonesia to determine the causes and effects of vitamin A deficiency in preschool children (Sommer, 1982).We present an analysis of infectious disease data on 250 Indonesian children, a subset of the cohort studied by Somer, Katz, and Tarwotjo (1983).The preschool children were examined up to six consecutive quarters for the presence of respiratory infection.There were 1,200 observations in total.We consider complete data for the first four visits from 548 pre-school children with or without the respiratory infection.
We focus on the question of whether vitamin A deficient children are at increased risk of respiratory infection, which is one of the leading causes of morbidity and mortality in children from the developing world.Such a relationship is plausible because vitamin A is required for the integrity of epithelial cells, the first line of defence against infection in the respiratory tract.Here the goal is to draw inferences on the change in respiratory infection status in the presence of intraperson correlation based on our proposed method.The model parameters were estimated and compared using the second order GEE (GEE2) approach of Prentice (1988) and our proposed semiparametric approach (SGEE2).
We set the binary response variable Y it = 1 if the ith child suffers from reparatory infection at the tth visit, and 0 otherwise, for i = 1, . . ., 137 and t = 1, . . ., 4. The covariates of interest include "Xerop", which represents presence/absence (1 or 0) of xerophthalmia, an ocular manifestation of chronic vitamin A deficiency; "Time" represents time t; "Age" represents the baseline age in months (centered at 36); height for age is represented by "Height", as a percent of the National Center for Health Statistics (NCHS) standard (centered at 90%), which indicates long-term nutritional status.

Analysis of Smoking Data
We also present an analysis of data on cigarette smoking trends from the Coronary Artery Development in Young Adults (CARDIA) study, an epidemiological study that recorded cardiovascular risk factors on five occasions over a 10-year period in black and white males and females (Hughes et al., 1987).This study was conducted in four urban centres (Birmingham, AL; Chicago, IL, Minneapolis, MN; and Oakland, CA) across the United States in which a total of 5,115 young adults aged 18-30 years were followed prospectively and examined up to five times from 1986 to 1996.Recruitment, restricted to blacks and whites, was carried out to achieve approximate balance in sample size with respect to age, race, gender, and education.Study participants were scheduled for visits at years 0, 2, 5, 7, and 10.We consider complete data for the first four visits from 3693 young adults with self reported smoking status (yes/no).
Here the goal is to draw inferences on the change in smoking prevalence of young adults in the presence of intraperson correlation based on our proposed method.The model parameters were estimated and compared using the second order GEE (GEE2) approach of Prentice (1988) and the proposed semiparametric approach (SGEE2).
Let for i = 1, . . ., 3693 and t = 1, . . ., 4, where Age i = age of individual i in years at baseline time; Time t = year since the baseline measurement = 0,2,5,7; the binary indicators Eduh i = 1 if ith individual's education level is high school or less, and 0 otherwise; Educ i = 1 if education level is up to some college, and 0 otherwise; Racebf i = 1 if the person is a black female, and 0 otherwise; Racewm i = 1 if the person is a white male, and 0 otherwise; and Racewf i = 1 if the person is a white female, and 0 otherwise.Exchangeable association structure is modeled here.We take the standard normal density as the kernel.The data-driven bandwidth h is used as discussed in Section 3.2.Here the estimated curve φ(u ′ ξ) does not show much evidence of a nonlinear trend, though a small curvature is visible.This suggests that the data might be well fitted by an ordinary GEE model.
Table 4 reports the estimates of the model parameters, their standard errors, and corresponding z-values.We observe that the estimates for all the covariates from both methods are generally close to each other.The primary reason for similar parameter estimates for the covariates is that the non-linear covariates that are included in this

Figure 2 .
Figure 2.Estimated nonlinear curves for the smoking data from the CARDIA Study