Consistency of Penalized Convex Regression

Abstract We consider the problem of estimating an unknown convex function f∗ : (0, 1)d → R from data (X1,Y1), · · · , (Xn,Yn). A simple approach is finding a convex function that is the closest to the data points by minimizing the sum of squared errors over all convex functions. The convex regression estimator, which is computed this way, suffers from a drawback of having extremely large subgradients near the boundary of its domain. To remedy this situation, the penalized convex regression estimator, which minimizes the sum of squared errors plus the sum of squared norms of the subgradient over all convex functions, is recently proposed. In this paper, we prove that the penalized convex regression estimator and its subgradient converge with probability one to f∗ and its subgradient, respectively, as n → ∞, and hence, establish the legitimacy of the penalized convex regression estimator.


Introduction
We consider the problem of estimating an unknown function f * : (0, 1) d → R from noisy observations (X 1 , Y 1 ), · · · (X n , Y n ) when one cannot assume any parametric form on f * and the only available information is the fact that f * is convex. We assume that the X i 's are independent and identically distributed (iid) (0, 1) d -valued random vectors, and Y i = f * (X i ) + ε i for 1 ≤ i ≤ n, where the ε i 's are iid random variables with mean zero and a finite variance.
This situation arises in many practical settings. For example, the long run average waiting time per customer in a single server queue is proven to be convex in the service rate (Weber (1983)). Various examples exist in the context of economics and queueing systems.
When the only available information is the fact that f * is convex, a simple approach to estimating f * is finding a convex function that is the closest to the data set (X 1 , Y 1 ), · · · (X n , Y n ). In other words, we seek to find the solution to the following problem: Subject to f : (0, 1) d → R is convex.
It appears that (1) is an infinite-dimensional problem. However, we can notice that there is a convex function f : (0, 1) d → R passing through (X 1 , f 1 ), · · · , (X n , f n ) if and only if there exists a subgradient ξ i ∈ R d at each X i for 1 ≤ i ≤ n, satisfying for 1 ≤ j ≤ n; see pp. 337-338 of Boyd and Vandenberghe (2004). Using this fact, it can be seen that (1) is equivalent to the following finite-dimensional problem: where f 1 , · · · , f n ∈ R and ξ 1 , · · · , ξ n ∈ R d ; see Hildreth (1954), Kuosmanen (2009), and Seijo and Sen (2011) for the details. In (2), f i corresponds to the value of the fitted function at X i , and ξ i is a subgradient of the fitted function at X i for 1 ≤ i ≤ n.
The solid line is the unknown function f * defined by f * (x) = 1/x for x ∈ (0, 1) and the circles are the observations Y i = f * (X i ) + ε i , the dashed line is the convex regression estimator, and the dotted line is the penalized convex regression estimator with λ n = 0.01. The ε i 's follow iid standard normal distributions We refer to the solution to (2) as the convex regression estimator, and this estimator has gained a great deal of attention from numerous researchers. Hanson and Pledger (1976) established consistency for the case when d = 1, Groeneboom et al. (2001) computed the rate of convergence for the case when d = 2, Seijo and Sen (2011) studies consistency for the case when d > 1, and Mazumder et al. (2019) proposed an efficient algorithm for solving (2). One can note that (2) is a convex quadratic program with n(d + 1) decision variables and n 2 linear constraints, so one can solve (2) by using convex programming solvers.
One of the drawbacks of the convex regression estimator is that it tends to overfit the data set near the boundary of the domain, so its subgradient gets large near the boundary. The main reason of this undesirable situation is that (2) is formulated in a way that only the sum of squared errors is minimized. Thus, one way to remedy this situation is adding a penalty term to the objective function of (2), which leads to the following formulation: where f 1 , · · · , f n ∈ R, ξ 1 , · · · , ξ n ∈ R d , λ n ≥ 0 is the smoothing constant, and (z 1 , · · · , z d ) (z 2 1 + · · · + z 2 d ) 1/2 for (z 1 , · · · , z d ) ∈ R d ; see Chen et al. (2020) for the formulation and Bertsimas and Mundru (2020) for an efficient numerical algorithm that solves (3).
The solution to (3), which we refer to as the "penalized convex regression estimator," exhibits nice numerical behavior such as bounded subgradients near the boundary of the domain. For example, Figure 1 shows an instance of the penalized convex regression estimator, compared to that of the convex regression estimator. The figure shows how the convex regression estimator overfits data near the boundary of the domain, forcing the subgradient to be large, whereas the penalized convex regression estimator has bounded subgradients throughout the domain. Thus, when estimating both f * and its subgradient, one may prefer the penalized convex regression estimator. Despite its appealing numerical performance, the statistical foundation of the penalized convex regression estimator has not been established so far.
The goal of this paper is to establish strong consistency of the penalized least squares estimator and its subgradient uniformly over any compact subset of (0, 1) d . Specifically, Theorems 1 and 2 state that the penalized least squares estimator and its subgradient converge almost surely to f * and the subgradient of f * , respectively, as n → ∞ uniformly over any compact subset of (0, 1) d . This paper is the first to establish the consistency of the penalized convex regression estimator and its subgradient, thereby legitimizing the penalized convex regression estimator as an estimator of f * .
In Section 2, we summarize notation and definitions. In Section 3, we describe our main results rigorously. We prove of the main results in Section 4. We observe the numerical performance of the penalized convex regression estimator in Section 5. Section 6 includes some concluding remarks.

Notation and Definitions
For z ∈ R d , z T denotes its transpose.
We say f : For a convex function f : We call the set of all subgradients at x the subdifferential at x, and denote it by ∂ f (x).

Main Results
To define the penalized convex regression estimator more rigorously, we first need to establish the existence of the solution to (3). Proposition 1 asserts that the solution to (3) exists uniquely.
(3) is a minimization problem of a continuous and coersive function over a non-empty closed domain, so the solution to (3) exists due to Proposition 7.3.1 and Theorem 7.3.7 on pp 216-217 of Kurdila and Zabarankin (2005). The solution is unique because the objective function of (3) is strictly convex. 2 It should be noted that (3) computes the penalized convex regression estimator only at the X i 's. More specifically, if (f 1 , · · ·f n ,ξ 1 , · · · ,ξ n ) is the solution to (3), thenf i is the penalized convex regression estimator computed at X i . To define the penalized convex regression estimator at x X i , we set for x ∈ (0, 1) d . The penalized convex regression estimatorĝ n , defined by (4), will be the subject of study in this paper. In order to establish the consistency ofĝ n and its subgradient, we need to make the following assumptions.
Our results are presented below.

Proofs of the Main Results
In this section, we prove Theorems 1 and 2.

Proof of Theorem 1
The proof consists of 10 steps.
Step 1: Since f * is convex, Step 2: We next prove that for some constant C and n sufficiently large a.s.
Step 3: We use Step 2 and the SLLN to show that for any subset of (0, 1) d , say A, with a nonempty interior, there exists a constant C A satisfying inf a.s. for n sufficiently large. To establish (9), suppose, on the contrary, we have with a positive probability. This implies with a positive probability. (10) contradicts the fact that a.s. for n sufficiently large by (6) and the SLLN. Hence, (9) follows.
Step 4: We use Step 2, Step 3, and the convexity ofĝ n to show that for any δ > 0, there is a constant C δ satisfying a.s. for n sufficiently large. (11) follows from similar arguments to the proofs of Lemmas 3.2 and 3.3 on page 1644 of Seijo and Sen (2011).
Step 7: We will utilize (6) and (13) to show that lim sup To fill in the details, let > 0 be given and set By the Cauchy-Schwarz inequality and (6), we have for n sufficiently large a.s. By taking δ small enough so that (15) ≤ 2 , we can ensure for n sufficiently large a.s.
By (20), there is i ∈ {1, · · · , d} satisfying for ξ ∈ ∂ĝ n (x n ) and infinitely many n with a positive probability, where e i ∈ R d is a vector of zeros except for 1 in the ith entry for 1 ≤ i ≤ d.
for infinitely many n with a positive probability. Suppose (22) holds. Then, there exists a subsequence x n 1 , x n 2 , · · · such that Note that by the definition of the subgradient, for any h > 0, By the continuity ofĝ n and Theorem 1, letting k → ∞ in both sides yields By (22) and (24), we have Vol. 10, No. 1;2021 and by the continuity of ∇ f * (·), we have for any h > 0.
By letting h ↓ 0 in (25), we obtain e T i ∇ f * (x 0 ) + /d ≤ e T i ∇ f * (x 0 ), which is a contradiction. Similarly, when we assume (23) holds, we can reach a contradiction. Hence, Theorem 2 is proved.

Empirical Studies
In this section, we compare the numerical behavior of the penalized convex regression estimator to that of the convex regression estimator. In Section 5.1, we assume that f * : (0, 1) 3 → R is given by a mathematical formula. In Section 5.2, f * : (1.2, 1.7) → R is the long run average waiting time per customer in an M/M/1 queue. Our findings from the numerical experiments are summarized in Section 5.3.

Simplified Example
We assume that f * : (0, 1) 3 → R is given by f * (z 1 , z 2 , z 3 ) = z 2 1 + 0.5z 2 2 + 0.2z 2 3 for (z 1 , z 2 , z 3 ) ∈ (0, 1) 3 , The X i 's are drawn uniformly from (0, 1) 3 . The Y i 's are drawn from Y i = f * (X i ) + ε i , where the ε i 's follow the normal distribution with mean 0 and standard deviation 0.1. Once the (X i , Y i )'s are obtained, we computed the convex regression estimator by solving (2) using CVX (Grant and Boyd (2014)). We also computed the penalized regression estimator by solving (3) with λ n = 1/n and λ = 1/(2n), respectively, using CVX. To evaluate the performance of the penalized regression estimatorĝ n (·), we computed the integrated mean square error (IMSE) between f * and the estimator as follows: and the IMSE between the gradient of f * and the subgradient ofĝ n as follows: where theξ i 's are the subgradients ofĝ n at the X i 's, computed from (2).
We then repeated this procedure 400 times independently, generating 400 IMSE values between f * andĝ n and 400 IMSE values between ∇ f * and the subgradient ofĝ n . Using these values, we computed the 95% confidence interval of the IMSE between f * andĝ n and the 95% confidence interval of the IMSE between the gradient of f * and the subgradient ofĝ n .
The IMSE values between f * and the convex regression estimator is computed similarly. We reported the 95% confidence intervals between f * and the estimators in Table 1 for a wide rage of n. We also report the 95% confidence intervals between the gradient of f * and the subgrdient of the estimators in Table 2 for a wide range of n. Table 1. The 95% confidence intervals of the IMSE between f * and the estimators when f * (z 1 , z 2 , z 3 ) = z 2 1 + 0.5z 2 2 + 0.2z 2 3 for (z 1 , z 2 , z 3 ) ∈ (0, 1) 3

Single Server Queue
We assume that f * (x) is the long run average waiting time per customer in an M/M/1 queue, where the service times follow the exponential distribution with mean 1/x for x ∈ (1.2, 1.7), and the interarrival times follow the exponential distribution with mean 1. The X i 's are drawn uniformly from (1.2, 1.7). For each X i , Y i is generated by averaging the waiting times of the first 5000 customers in the single server queue, initialized empty and idle, with the service rate of X i . Once the (X i , Y i )'s are obtained, we computed the convex regression estimator by solving (2) using CVX. We also computed the penalized regression estimator by solving (3) with λ n = 1/(20n) and λ = 1/(40n), respectively, using CVX.

Observations from Numerical Experiments
Tables 1 and 3 show that both the convex regression estimator and the penalized convex regression estimator converge in terms of the IMSE. On the other hand, Tables 2 and 4 indicate that only the subgradient of the penalized convex regression estimator shows convergence in terms of the IMSE, and that the subgradient of the convex regression estimator diverges as n increases in terms of the IMSE. This phenomenon is consistent with what we observed in Figure 1; the convex regression estimator has extremely large subgradients near the boundary of the domain, while the penalized convex regression estimator has bounded subgradients throughout the domain.
It is also observed that the numerical performance of the penalized convex regression estimator is highly dependent on the smoothing constant λ n . Thus, the issue of how to select the smoothing constant is a promising future research topic.

Conclusions
In this paper, we established consistency of the penalized convex regression estimator. Numerical experiments show that, unlike the convex regression estimator, the penalized convex regression estimator and its gradient are convergent near the boundary of its domain. Hence, a promising research topic for the future is a thorough examination of the behavior of the penalized convex regression estimator near the boundary of its domain.