Causal Subclassification Tree Algorithm and Robust Causal Effect Estimation via Subclassification

In observational studies, the existence of confounding variables should be attended to, and propensity score weighting methods are often used to eliminate their effects. Although many causal estimators have been proposed based on propensity scores, these estimators generally assume that the propensity scores are properly estimated. However, researchers have found that even a slight misspecification of the propensity score model can result in a bias of estimated treatment effects. Model misspecification problems may occur in practice, and hence, using a robust estimator for causal effect is recommended. One such estimator is a subclassification estimator. Wang, Zhang, Richardson, & Zhou (2020) presented the conditions necessary for subclassification estimators to have √ N-consistency and to be asymptotically well-defined and suggested an idea how to construct subclasses.


Introduction
One of the main purposes of observational studies is to estimate the causal effect of an intervention or a treatment on the outcome. The ideal setting to estimate causal effect is a randomized control trial (RCT). However, in practice, treatments or interventions cannot be assigned randomly. Typical examples are advertisement placement in marketing and smoking status in epidemiology. There are various reasons that make random assignment difficult, ranging from practical to ethical. Particularly in observational studies, treatments or interventions often depend on covariates because controlling confounding variables is difficult. In cases where the outcome is also affected by covariates, if we do not pay enough attention to assignment mechanisms, the estimated effect of the treatment may be biased.
When we estimate causal effects using the propensity score model, we need to be careful to avoid model misspecification of the propensity score. For example, Kang and Schafer (2007) pointed out the instability of the IPW estimator (Hirano et al., 2003) and the Doubly Robust estimator (Robins, Rotnitzky, & Zhao, 1994) caused by misspecification of the propensity score model. In practice, it may be difficult to get an exact specification of the true propensity score model; however, we need to consider this aspect when using the model.
One popular approach to address the model misspecification problem is subclassification, which involves grouping units into K subclasses based on their estimated propensity scores. For example, Rosenbaum and Rubin (1984) recommended using K = 5 as the number of subclasses. Unlike the IPW estimator, the ordinal subclassification estimator (cf. Rosenbaum & Rubin, 1983) does not depend on the estimated propensity score itself, but uses the rank information of the estimated propensity score. Therefore, the subclassification estimator is robust with respect to propensity score outliers or slight model misspecification for propensity score.
However, to use a subclassification estimator in practice, it is necessary to determine a method for subclass construction (i.e., we need to choose the lower and upper bounds of the propensity score in each subclass). A commonly used approach to determine the propensity score cut-off points of subclasses is to use quantiles of estimated propensity scores with a fixed number of subclasses, 5 to 10. A problem with this approach is that the length of each subclass does not converge to 0 even if the sample size N goes to infinity, and hence, the subclassification estimator is asymptotically biased. To avoid this problem,  proposed conditions between sample size and number of subclasses at which the subclassification estimator becomes √ N-consistent and asymptotically well defined. They also proposed a theoretical guideline to construct subclasses and suggested constructing subclasses such that each subclass contains at least one observation from treatment and control groups. However there are two concerns. One is about the variance of the subclassification estimator obtained by their suggested method. When the number of samples in each subclass becomes small, then the variance of causal estimators in each subclass may become large, so the subclassification estimator may have large variance. The other concern is whether the suggested method satisfies their theoretical guideline.
In this paper, we propose a new subclass construction algorithm that satisfies 's conditions for a subclassification estimator to become √ N-consistent and asymptotically well-defined. Additionally, we show that our algorithm includes 's suggested method as a special case, and prove that their method does not satisfy their conditions that the subclassification estimator becomes asymptotically well-defined.
We call this algorithm the causal subclassification tree. In our algorithm, we use the decision tree by Breiman, Friedman, Stone, and Olshen (1984) with the sum of variance of propensity scores or sum of variance of potential outcomes in child nodes as the impurity measure. To achieve 's conditions, we impose constraints on splitting rules controlled by two parameters: on each split leaves at least a fraction α of the available training examples on each side of the split and, the trees fully grown to depth for some ∈ N, i.e., between and 2 − 1 observations of treatment and control groups in each terminal node of the tree. Next, terminal nodes are used as strata for subclassification. In section 3, we show that the number of leaves generated by our algorithms with = O( √ N log(N)) satisfies the conditions by . We also proposed a method for selecting parameters ( , α) and extended our subclassification algorithm to multi-class treatment.
Some researchers claim that when propensity scores are estimated by machine learning (ML) methods, such as random forest, boosting, or neural networks, the resulting propensity score weighting estimator has a larger bias or RMSE in some cases than when a logistic regression model is used for propensity score estimation (i.e., Cannas & Arpino, 2019). We also confirm this claim, as we show in section 4. However since combining a causal subclassification tree algorithm and weighting estimator with the ML method shows a relatively small bias and variance compared to the original weighting estimator with ML methods, when the propensity score model is misspecified, the weighting estimator with the ML method combined with a causal subclassification tree shows a better performance than a weighting estimator with a logistic regression model and CBPS. This paper is organized as follows. The rest of Section 1 describes the symbols and assumptions used in this paper. Section 2 introduces a general subclassification estimator for causal effect and the results of  on √ N-consistency of the subclassification estimator. In section 3, we propose the subclassification algorithm and show its asymptotic properties. We then extend our algorithm to the case of multi-class treatment and estimation for weighted average treatment effects. In section 4, we confirm that the weighting estimators for causal effects become robust by applying our methods through simulation using Kang and Schafer (2007)'s settings. We end with a discussion of the findings in section 5.

Notation for Binary Treatment
Let us consider a random sample of N observations from a population P. For each unit i, we observe a binary treatment variable T i ∈ {0, 1}, a p-dimensional column vector of observed pre-treatment covariates X i , whose support is denoted by X, and outcome Y i ∈ R. The propensity score is defined as the conditional probability of receiving the treatment given the covariates X i . Following Rosenbaum and Rubin (1983), we assume that the true propensity score is bounded away from 0 and 1: This assumption means that all individuals in the data have a non-zero probability of receiving or not receiving treatment.
To estimate causal effects from the data, additional assumptions must be made to the data generation process. We assume the unconfoundedness assumption, where Y i (t) are potential outcomes corresponding to treatment assignment t ∈ {0, 1}.
Under the unconfoundedness assumption, Rosenbaum and Rubin (1983) showed that treatment assignment is ignorable given the (true) propensity score π(X i ).
, the average treatment effect can be written as the difference between the expected outcomes of the treatment group and the control group conditions on propensity score, In fact, the propensity score is a continuous quantity on (0, 1), and there is no guarantee that there will be samples with the same propensity score. In general, the following three estimators are often used. The first is a matching method that measures the closeness of the propensity score between samples using an appropriate distance function or kernel function, and compares those with close distances. The second is an Inverse Probability Weighting estimator (IPW estimator, Hirano et al., 2003) that weights samples by the reciprocal of the propensity score. The last is a stratification method that divides the data into subclasses of 5 to 10 based on the propensity score and then compares them by subclass.

Subclassification Estimator
Subclassification estimators are constructed in two steps. First, the observed data are divided into subclasses based on propensity scores, and then the averages of outcomes of treatment and control groups belonging to the same subclass are compared (Rosenbaum & Rubin, 1984).
Letπ min andπ max be the minimum and the maximum of estimated propensity scores, respectively, andq k , k = 0, 1, . . . , K be the values such thatπ min =q 0 <q 1 < · · · <q K−1 <q K =π max where K is the number of subclasses. The set ofĈ k = [q k−1 ,q k ), (k = 1, 2, ..., K − 1) andĈ K = [q K−1 ,q K ] forms a partition of the interval [π min ,π max ]. Let N k be the number of units in subclass k and N tk be the number of subjects in treatment group t in subclass k for t = 0, 1 and k = 1, · · · , K. That is, N k = N i=1 I(π(X i ) ∈Ĉ k ), and N tk = N i=1 I(π(X i ) ∈Ĉ k )I(T i = t). Then the subclassification estimator for average treatment effectτ S is given bŷ For subclassification estimators, setting K = 5 is recommended in Rosenbaum and Rubin (1984), and the quantiles of estimated propensity score are used for K cut-off points. However, as is well known, if K is fixed,τ S may become a biased estimator for τ (e.g., Lunceford & Davidian, 2004).  proposed increasing K according to the sample size N to avoid bias that may occur when K is fixed. That is, the number of subclasses K = K(N) is set to satisfy the following properties: They also proposed the Hybrid Estimatorτ H , http://ijsp.ccsenet.org International Journal of Statistics and Probability Vol. 10, No. 1;2021 and showed thatτ H is √ N-consistent estimator for τ under conditions (2). Moreover, they showed that if the condition (K(N)) log(K(N))/N → 0.
So, toτ H be asymptotically well-defined, the condition is that the number of subclasses should grow slowly enough so that all subclasses contain at least one observations from treatment and control group.
In , they proposed using the maximal number of subclasses such that the hybrid estimator is well defined: K max = max{K :τ H is well defined}, and called the hybrid estimator with K max as the full subclassification estimatorτ FS given bŷ Although  provided a theoretical guideline for choosing K, they did not give a practical procedure for finding K max , but suggested using the largest K such that all subclassesĈ 1 , ...Ĉ K contain at least one observation from each treatment group. However, does their suggested procedure satisfy conditions (2) and (4) ?
In the following sections, we first propose a new algorithm controlled by two parameters to generate subclasses and show the conditions for its parameters so that the number of subclasses K(N) generated by the proposed algorithm satisfies conditions (2) and (4). We then point out that 's suggested procedure is a special case of our algorithm and show that the resulting subclassification estimator with their suggested procedure does not satisfy condition (4).

Causal Subclassification Tree
In this section, we propose an algorithm named causal subclassification tree, which generates subclasses C 1 , ..., C K automatically from estimated propensity scores and potential outcomes. We evaluate upper and lower bounds of the number of subclasses K(N) generated by our algorithm, and show such K(N) satisfies conditions (2) and (4). Then, we discuss the robustness of the subclassification estimator with our algorithm in section 3.3, and explain the advantages of our algorithms in practical situations. The parameter tuning procedures and the extension of our algorithm to multi-class treatment are described at the end of this section.

Algorithm of Causal Subclassification Tree
Let the parent node be P and the child nodes C, C , which are generated from P. Further, let (N 0P , N 1P ), (N 0C , N 1C ), (N 0C , N 1C ) be the number of samples of treatment and control groups in parent node P, child node C, and child node C , respectively. Then, splitting the node is controlled by parameters ( , α) as follows. For a given ∈ N and α ∈ (0, 0.5], the causal subclassification tree splits the parent node P into C, C only when the conditions are satisfied. Condition (6) controls the minimum leaf size of a tree so that each leaf contains at least units of each group, and condition (7) controls the ratio of units that child nodes C and C contain from the parent node P. A moderate value of α prevents extremely imbalanced child nodes Next, we describe our splitting rules. Let the estimated propensity score for sample i beπ i . For a split of node P to C and C , we define the splitting criteria ∆ 1 (C, C ) as follows.
where,π C andπ C are the averages of the estimated propensity scores in the child nodes C and C . At node P, we choose the split that minimizes ∆(C, C ) among all the splits that satisfy condition (6) and (7). Nodes are split recursively until no split satisfies conditions (6) and (7), i.e., each leaf has more than but less than 2 − 1 samples of either classes. The procedure for the causal subclassification tree is described as follows.
, minimum leaf size , sample fraction α. 1. find the split of the node that minimizes ∆ 1 (C, C ) among the splits satisfying conditions (6) and (7), and apply this split to the node. Repeat this step recursively until no split satisfies (6) and (7).
3. Compute subclassification estimatorτ H , with equation (3) This criterion (8) is the sum of the variances of the estimated propensity scores in the child nodes C and C . Instead of this criterion, we can also use the sum of the squared prediction error, a Gini coefficient, or information gain as the criterion.
We can use the following splitting criteria ∆ 2 (C, C ) instead of ∆ 1 (C, C ), which uses potential outcomes Y(1) and Y(0): which are the averages of the observed potential outcomes in child nodes C and C , respectively. Athey and Imbens (2016) pointed out that, when splits and outcomes are not independent, the subclassification estimator may have a bias. This problem can be avoided by applying the Double-Sample Tree (Wager & Athey, 2018) to the procedure. 1. Doublesample tree is a technique to achieve honesty (Wager & Athey, 2018) by dividing its training subsample into two halves I and J. It uses the J−sample to place the splits, while holding out the I-sample to make the within-leaf estimation.
2. Find the split of the node that minimizes ∆ 2 (C, C ) among satisfying conditions (6) and (7), and apply this split to the nodes with all observations from I-data, and X-or T -observations from J-data, but without outcomes Y-observations from J-data. Repeat this step recursively until no split satisfies (6) and (7) for I-data.
Remark 1. Here, and α are parameters for controlling the tree so that extreme division is not generated. These parameters are also used for Wager and Walther (2015) and Causal Forest (Wager & Athey, 2018) for generating αregular tree.

Properties of the Numbers of Subclasses Generated by Causal Subclassification Tree
In this section, we discuss the asymptotic behavior of the number of subclasses K(N) generated by the causal subclassification tree. The following theorem shows its upper and lower bounds.
Theorem 1 The number of subclasses K(N) generated by the causal subclassification tree with parameter ( , α) satisfies the following inequality.
where N is the number of observations and N 1 and N 0 are the number of observations in the treatment and control groups, respectively. Thus, N 1 + N 0 = N. Proof.
Let d min and d max be the minimum and the maximum depth of the causal subclassification tree with sample size (N 1 , N 0 ) and parameters ( , α). Because a child node contains at least 100α% units of its parent node by condition (7), the following inequality holds, Additionally, because the child node contains at the most 100(1 − α)% units of its parent node, it holds Thus, we have the following inequality: where we use logarithm base as 2, because two leaves are generated from one split. The number of terminal nodes in a tree is larger than 2 d min and smaller than 2 d max . Therefore, the inequality (9) holds.
Theorem 2 For α ∈ (0, 0.5], and = √ N/ log(N), the number of subclasses K(N) generated by causal subclassification tree satisfies condition (2) and (4); The lower bound in (9) is shown to be and the upper bound is also shown to be International Journal of Statistics and Probability Vol. 10, No. 1;2021 Thus, by inequality (9), and conditions (2) and (4) hold.
According to theorem 2, if we set = √ N/ log(N), then K(N) satisfies conditions (2) and (4). Therefore, the subclassification estimator obtained by the causal stratification tree is an √ N-consistent estimator for τ and is asymptotically well defined.

Properties of Causal Subclassification Tree
In this section, we first explain the connection between the causal subclassification algorithm and Wang et al. (2020)'s strategy : using the largest K such that all subclassesĈ 1 , ...Ĉ k contain at least one observation from each treatment group. Their strategy to generate subclasses can be considered as a special case of our algorithm with a small enough parameter α and parameter = 1. When we set = 1, the number of units from both/either of the treatment or control groups in each subclass is 1, so this is one of the possible implementations of their strategy. However, if we set = 1 for any sample size N, because the lower and upper limits of the number of subclasses K(N) are given by inequality (9),  (4) and it is not guaranteed to be asymptotically well-defined.
Although the subclassification estimator with fixed does not satisfy the condition (4), if we increase in the order of √ N/ log(N), the resulting subclassification estimator becomes asymptotically well-defined. We choose , which minimizes the loss function introduced in section 3.5 by cross-validation from the range By theorem 1 and theorem 2, the subclassification estimator with causal subclassification tree algorithm with our strategy satisfies conditions (2) and (4), therefore it is √ N−consistent and asymptotically well defined.
We would like to mention/emphasize here that, unlike conventional methods that automatically choose cut-off points of propensity score estimates based on quantiles, the causal subclassification tree chooses cut-off points that minimize criteria such as ∆ 1 or ∆ 2 . This is because the choice of cut-off points affects the performance of the resulting subclassification estimator, and we would like to control it as much as we can by defining criteria that reflect our thought of "good performance" and choosing cut-off points based on such a performance. For example, if we use quantiles of the estimated propensity score, it is difficult to explain why we use such cut-off points or answer whether there are any better cut-off points. For these practical reasons, data-driven cut-off point determination is important in data analysis.
If we use a causal subclassification tree, because the criterion function ∆ is needed, then the resulting subclasses satisfy the properties that are led by ∆. For example, if we use ∆ 1 as a criterion function in the causal subclassification tree, we can obtain subclasses that have a small variance of estimated propensity scores in each subclass, or if we use ∆ 2 as a criterion function, subclasseses such that the variance of observed outcomes in each subclass is small are generated. In addition, if we use ∆ 3 as a criterion function, we can generate subclasses that minimize the likelihood of distribution of treatment variables T . Therefore, from these discussions, one of the attractive properties of using the causal subclassification tree is that researchers can change the arbitrary criterion function with intention using properties so that subclasses may be satisfied.

Subclassification Estimator for Weighted Average Treatment Effect
In practical situations, we may want to estimate not only the average treatment effect, but also the weighted average treatment effects. For example, we are often interested in the average treatment effect on the treated (ATT), average treatment effect on the untreated (ATU), or the treatment effect on the overlap (ATO, Li, Morgan, & Zaslavsky, 2018). In this section, we introduce subclassification estimators for the weighted average treatment effect. The IPW estimator for average treatment effect on the treated(ATT), E[Y i (1) − Y i (0)|T i = 1], is given by The subclassification estimator for ATT can be obtained by replacing 1/π i with subclassification weight 1/p i as follows: is the difference in the average of the potential outcomes in the treatment and control groups on subclass C k . Therefore, the subclassification estimator for ATT can be considered as a weighted mean of causal effects of subclasses with sample sizes of the treatment group. This shows that a subclass with a high probability of receiving treatment is assigned a large weight.
Similarly, the subclassification estimator for the Average Treatment effect on the Untreated (ATU), , and the subclassification estimator for the Average Treatment effect on the Overlap (ATO, Li et al. (2018)) is given bŷ

A Tuning Algorithm for Parameter ( , α)
We discuss three parameter tuning methods for ( , α). All of them use a cross-validation method for choosing parameters, and to choose and α such that K(N) satisfies condition (2) and (4), we explore a range √ min{N 1 , N 0 } 2 log(min{N 1 , N 0 }) , 3 √ max{N 1 , N 0 } 2 log(max{N 1 , N 0 }) for the parameter and (0, 0.5] for the parameter α, where N 1 and N 0 are the sample sizes of treatment and control groups. For example, when the sample size N = 1000 with (N 0 , N 1 ) = (200, 800), we explore [2,6] for the parameter as candidates. The first method involves finding the minimizer of the cross-validation prediction error of the causal stratification tree. The second method involves finding the minimizer of the measure of covariates imbalance, for example, Imai and Ratkovic (2014)'s imbalance measure: , or the one by Wang and Zubizarreta (2019): where w i are the inverse of estimated propensity scores. These two approaches are intuitive. However, it is well known that if the model for propensity score contains all variables that affect potential outcomes including confoundings, then the mean square error of IPW estimator becomes smaller than in the case when the propensity score model contains only confounding variables (Brookhart et al., 2006).. Hence, we consider a measure Γ that uses a variance of potential outcomes, defined as follows: Here, Γ k is the sum of the variances of observed outcome of each group in the subclass C k . We find a parameter ( , α) that minimizes Γ by cross validation.

Extension to Multi-class Treatment Regimes
We extend the causal subclassification tree to causal inference with multi-class treatments. Let T i be the treatment that takes one of the K integer values, i.e., T i ∈ T = {0, 1, 2, ..., K − 1}, where K ≥ 2, and Y i (t), t = 0, 1, ..., K − 1 are potential outcomes corresponding to treatment t. Following Imbens (2000), we define the generalized propensity score as where all conditional probabilities sum to 1, i.e., K−1 k=0 π k (X i ) = 1. We may use a multinomial logistic regression model to estimate the generalized propensity scores.
In causal inference with multi-class treatment regimes, we are interested in difference of effects between two different treatments t and t , ]. An IPW estimator for τ(t, t ), proposed by Feng, Zhou, Zou, Fan, and Li (2012), is defined asτ However, this estimator becomes unstable when there are some misspecification in the propensity score model. Thus, we propose a new subclassification method that extends the causal subclassification tree algorithms to multi-class treatment regimes.
Letπ i = (π 0 i ,π 1 i , ..., π K−1 i ) T be the estimated generalized propensity scores. We can extend the causal subclassification tree to multi-class treatment regimes only by replacing ∆ 1 in (8) with the following ∆ 3 : Then, just as in the binary treatment case, we define the weights for each treatment t as follows.
where N k is the number of units in subclass C k , and N tk is the number of units that received treatment t in subclass C k . A subclassification estimator for τ(t, t ) is defined by replacing 1/π t i and 1/π t i with w it and w it , respectively, aŝ

Simulation Studies
We apply the causal subclassification tree to the simulation data in almost the same settings as Kang and Schafer (2007), where propensity score models were slightly misspecified. In the controversial paper, Kang and Schafer (2007) conducted a set of simulation experiments to study the performance of propensity score weighting methods. They found that the misspecification of a propensity score model can affect the performance of various weighting methods. In particular, they showed that although the doubly robust estimator of Robins et al. (1994) provides a consistent estimate for the treatment effect if either the outcome model or the propensity score model is correct, the performance of the doubly robust estimator can deteriorate when both models are slightly misspecified. In this section, we describe the simulation experiment performed with the data generated in Kang and Schafer (2007)'s model. We then examine whether the subclassification estimator with the causal subclassification tree can improve the empirical performance of propensity score weighting estimators and full subclassification estimator constructed by 's procedure.
Our data-generating process is as follows. There are four pretreatment covariates X i = (X i1 , X i2 , X i3 , X i4 ) each of which is independently, identically distributed as N(0, 1). The true outcome model is a linear regression with these covariates and the error term is an independently and identically distributed standard normal random variate: N(0, 5). The mean outcome of the treated observations is 210, which is the quantity of interest to be estimated.
The true propensity score model is a logistic regression, with X i being the linear predictor such that the mean probability of receiving the treatment equals 0.5, Finally, only the non-linear transforms of covariates are observed, and they are given by In this simulation, three propensity score weighting estimators are investigated: the Horvitz-Thompson estimator (HT, Horvitz & Thompson, 1952), the inverse propensity score weighting estimator (IPW, Hirano et al., 2003), and the doubly robust estimator (DR, Robins et al., 1994) given bŷ The true model for propensity scores is a logistic regression with X i as covariates. Thus, the models with Z i as covariates are misspecified for these data. Likewise, the true outcome model is a linear regression with X i as covariates and the models with Z i as covariates are misspecified. However, only the DR estimator uses an outcome model, because outcome models need not be considered for HT and IPW estimators. When we use causal subclassification weights(CSW) w i1 instead of estimated propensity scores π i in each estimator, we denote them as HT-CSW, IPW-CSW, and DR-CSW.
In addition, when we use full subclassification weights (FSW, , we denote them as HT-FSW, IPW-FSW, and DR-FSW. To obtain full subclassification weights, following the discussion in section 3.3, we use a causal subclassification tree with the parameter = 1. To estimate the propensity score, Kang and Schafer (2007) used logistic regression with Z i as predictors, i.e., π(X i ) = logit −1 (Z T i β), that is, the model was misspecified because the true propensity score model is a logistic regression with International Journal of Statistics and Probability Vol. 10, No. 1; 2021 X i as predictors. In our simulation, in addition to logistic regression, we use Covariate Balancing Propensity Score (CBPS, Imai & Ratkovic, 2014), Random Forest (Breiman, 2001), and XgBoost (Chen & Guestrin, 2016) to estimate the propensity score. We use the same propensity and outcome model specifications as Kang and Schafers and investigate whether the use of causal subclassification weightsp i instead of the estimatesπ i by LR, CBPS, RF, and XgBoost improves the empirical performance of these estimators. For each estimated propensity score with several estimation methods, we computed causal subclassification weights using the causal subclassification tree (Procedure.2). To train each tree, we used the parameter-tuning algorithm in subsection 3.5, which uses potential outcomes. In other words, our simulation study examines how replacing the reciprocal of estimated propensity score with causal subclassification weights will improve the empirical performance of the three commonly used weighting estimators. As in Kang and Schafer's study, we conducted simulations under the following four scenarios: (a) both propensity score and outcome models are correctly specified, (b) only the propensity score model is correct, (c) only the outcome model is correct, and (d) both the propensity score and the outcome models are misspecified.
For each scenario, we conducted 1000 Monte Carlo simulations with a sample size of 1000 and computed the bias, variance, and root-mean-squared error (RMSE) for each estimator. The results are presented in Table 1, 2, 3. HT and IPW estimators do not depend on outcome models, and hence, we do not fill the values into corresponding cells in each table.
As discussed in section 3.3, HT-CSW and IPW-CSW are the same estimators, and hence, we do not write the result of HT-CSW in each table. For each scenario, we computed the bias, variance, and RMSE for each weighting estimator on the basis of five different estimation methods for the propensity score: (i) the standard logistic regression with X i being the linear predictor as in the original simulation study (LR), (ii) the just-identified CBPS (Imai & Ratkovic, 2014) estimation with the covariate balancing moment conditions with respect to X i and without the score condition (CBPS(exact)), (iii) the overidentified CBPS estimation (Imai & Ratkovic, 2014) with both covariate balancing and score conditions (CBPS(over)), (iv) the Random Forest with X i as predictors (RF), and (v) the XgBoost with X i as predictors (XgBoost).
In the first scenario (a), the case where both models are correct, HT, IPW, and DR estimators have relatively low bias when LR and CBPSs are used as models for propensity score regardless of whether we use the causal subclassification tree. However, HT and IPW estimators have a larger absolute bias compared to those with CSW and FSW when RF and XgBoost are used. Each subclassification estimator has about the same or lower variances compared to those with estimated propensity scores themselves. Comparing CSW and FSW, both have almost the same absolute bias; however, the variances of FSW estimators are twice as large as those of CSW. As a result, the RMSE of estimators with causal subclassification weights are relatively smaller than those with original weights except CBPS(exact), especially when RF and XgBoost are used, and also smaller than those of full subclassification weights. DR is not sensitive to the choice of propensity score estimation methods because of its property.
The second simulation scenario, (b), shows the performance of various estimators when the propensity score model is correct but the outcome model is misspecified. The results for HT and IPW are same as those of the first scenario (a) because these estimators only depend on the model for propensity scores. For the DR estimator, DR-CSW and FSW have relatively similar biases and smaller variances compared to DR when LR and CBPS are used for the propensity score model. In contrast, when RF and XgBoost are used for the propensity score model, DR-CSW has a smaller bias, but a relatively large variance. As a consequence, RMSE for all estimators with causal subclassification weights are smaller than those with the estimated propensity score themselves. Comparing DR-CSW and DR-FSW, both have almost the same absolute bias; however, the variances of FSW estimators are twice as large as those of CSW. As a result, all RMSE of the DR-CSW estimator are smaller than those of DR-FSW.
The third scenario, (c), is the situation where the propensity score model is misspecified whereas the outcome models for estimators DR is correct. Because HT and IPW estimators rely only on the propensity scores, these estimators may have large bias and/or variance. For LR, IPW-CSW and IPW-FSW have larger biases but smaller variances compared to the IPW estimator (for HT estimator, HT-CSW and HT-FSW have smaller biases and variances). RMSE for IPW-CSW and IPW-FSW are smaller than those for HT and IPW. For IPW-CSW with both, CBPS have larger bias and variance compared to IPW. For HT and IPW estimators with RF and XgBoost, IPW-CSW have about the same or smaller bias and variance than HT and IPW, respectively. When LR, RF, and XgBoost are used, HT-CSW and IPW-CSW have relatively smaller RMSE, but when both CBPS are used, HT-CSW and IPW-CSW have larger RMSE. This is an interesting phenomenon. This is because the subclassification estimator with our weights has a consistency when the order of propensity score is correctly specified (by the properties of regression tree, i.e., Wager & Walther, 2015). LR, RF, and XgBoost are non-linear prediction models for W, but CBPS is not. CBPS finds the solution for balancing equations for covariates of treatment and control groups, and hence, the performances of HT and IPW estimators with LR, RF, and XgBoost are improved with CSW but the estimators with CBPS are not. Comparing the subclassification estimators CSW with FSW without DR in scenario (c), in short, a similar result is confirmed for scenario (a) and (b). Both of these estimators have similar absolute biases, and FSW has larger variances than CSW.
In the final simulation scenario (d), the performance of estimator DR deteriorated because both the propensity score and the outcome models are misspecified. The bias and RMSE for DR are the largest in all scenarios. However, for all the propensity score estimation methods, DR-CSW and DR-FSW have relatively smaller biases and RMSE compared to DR. Both DR and DR-CSW with RF or XgBoost have small RMSE compared to other cases; especially, DR-CSW with ML methods achieves the smallest RMSE among the propensity score models. As in the previous scenarios, let us compare the biases and variances of the subclassification estimators of CSW and FSW; the same result is seen, where both of these estimators have a similar absolute bias, and FSW has a larger variance than CSW.
Summarizing the above discussions, weights that we proposed tend to improve the performance of all estimators with original estimated propensity scores and show a robustness for model misspecification. Especially regarding RF and XgBoost, all estimators with causal subclassification weights improve bias, variance, and RMSE dramatically than those without CSW. If the propensity score estimates are directly used, machine learning methods, such as RF and Xgboost, are not recommended. However, by combining them with our proposal algorithm, causal subclassification tree, these machine learning methods overperform LR and CBPSs when models are misspecified. Furthermore, by comparing the result of the subclassification estimator for CSW and FSW, for all scenarios, both have the same bias, but FSW has twice as large of a variance as CSW, so for all scenarios and estimators, the subclassification estimators with CSW have smaller RMSE than those of FSW.

Concluding Remarks
In observational studies, the existence of confounding variables should be attended to and propensity score weighting methods are often used to eliminate their effects. The propensity score methods are extended for the various settings including multiple treatment, longitudinal, or time-varying treatments. Several papers discuss the risk of misspecification of propensity score models (e.g., Drake, 1993); however, in practice, insufficient attention has been paid for the propensity score estimation or the instability of weighting estimators.
In this paper, we proposed a subclass construction algorithm, a causal subclassification tree, that satisfies the conditions given by  for the subclassification estimator to have √ N-consistency and to be asymptotically well defined. We also showed that 's suggested procedure does not satisfy condition (4). The advantage of the causal subclassification tree is that, unlike conventional methods that do attend to the choice of cut-off points of propensity score estimates, the causal subclassification tree chooses cut-off points that minimize criteria. By defining criteria that reflect our thoughts about "good performance," we can control the performance of the causal estimator.
The simulation study showed that the proposed method improves the performance of propensity score weighting estimators, especially when the propensity score model is misspecified. When both models are misspecified, doubly robust estimators with our weights performed better than the original one, and doubly robust estimators with machine learning methods achieve smaller RMSE than those using ordinal propensity score estimation methods. That is, doubly robust estimators with machine learning methods combined with our algorithm have the smallest RMSE when both models are misspecified.
We also compared the empirical performance of subclassification estimator using our algorithm and the subclassification estimator constructed by s suggested procedure. The simulation result shows that from the view of estimation bias, there is little difference in the two estimators, but from the view of the variance of estimators, the full subclassification estimator has twice as much variance as that with a causal subclassification tree.
Lastly, we mention future work. If we use subclassification methods, the following two remaining effects for causal estimators in each subclass should be considered. One is the effect of residual within-subclass confounding (e.g., Lunceford