Dynamic Attribute-Level Best Worst Discrete Choice Experiments Dynamic Attribute-Level Best Worst Discrete Choice Experiments

Dynamic modelling of decision maker choice behavior of best and worst in discrete choice experiments (DCEs) has numerous applications. Such models are proposed under utility function of decision maker and are used in many areas including social sciences, health economics, transportation research, and health systems research. After reviewing references on the study of such experiments, we present example in DCE with emphasis on time dependent best-worst choice and discrimination between choice attributes. Numerical examples of the dynamic DCEs are simulated, and the associated expected utilities over time of the choice models are derived using Markov decision processes. The estimates are computationally consistent with decision choices over time.


Introduction and Motivation
Discrete choice experiments (DCEs) are applied in social sciences, health economics, transportation research, and health systems (see Potoglou et al., 2011;Lancsar & Louviere, 2008;Greene & Hensher, 2003). DCEs and their models (Discrete choice models, DCMs) focus on predicting a decision maker's choices in products or services. In many cases, they are time dependent. Such research has not been practically implemented in attribute-level best-worst DCE, in which the decision maker's task is to choose the best option and the worst option from a choice set instead of just the best option, as in traditional DCEs. This class of experiments falls under best-worst scaling (BWS) experiments (Louviere et al., 2015). Here, we apply the BWS models over a time sequence to quantify and measure decision maker behavior and derive the utilities using Markov decision processes (MDPs). The change in utilities from one time to the next is described in the form of gain or loss. The utility is composed of a systematic component that is dependent on the key attributes of the product and a random component. Train (2009) presents multiple models based on different assumptions about the distribution of the random component. In his suggested model, the error terms are assumed to be homogeneous and uncorrelated (Train, 2009). By assuming the covariates are generated under a normal distribution and the error terms under a generalized extreme value distribution, the output data is then modeled as binary and conditional logit. Our focus is on the conditional logit assumption but add a dependence structure through time and transition probabilities under MDPs.

Attribute-Level Best-Worst Designs
In traditional DCE, we have a sample of n decision makers with J alternate choices. The utility function for the ith individual selecting the jth choice is given as: in which is the systematic component and is the unobserved component, or error term, where = 1, … , and = 1, … , .
In McFadden (1974), a common distribution for the error terms was proposed, which was the Type I extreme value distribution or Gumbel distribution. That assumption leads to the conditional logit for modelling the data. Train (2009) presented other models and associated assumptions in modelling the choice made by the decision makers. As stated in Train (2009), the most important criterion is not so much about the shape of the error terms but that the errors are correlated and possibly also the utilities. To allow for dependence in choices, the error terms may be distributed as normal, and that assumption allows the outcomes to be modelled under the probit or the generalized extreme value distribution.
A model that describes the behavior is described through the conditional logit, and the utility associated to the various products are then estimated, with the error term of the utility from the Type I extreme value distribution.
The systematic component is given as: with describing the ith subject's covariates on the jth alternative, and isdefined as the subject specific covariate estimates.
The utility is then given as in Equation (1). Hence, the probability of the jth alternative being chosen by ith subject is: with ∑ = 1 and for the set of all possible choices.
The above can be seen as a special approach at the intersection of information theory (e.g., entropy function) and the multinomial logit (Anas, 1983). Such model can be enhanced by adding attributes associated with alternatives.

Attribute-Level Best-Worst DCE
Attribute-level best-worst scaling (or BWS) are modified DCEs designed to elicit the impact the attributes and attribute-levels have on the utility of a product. As mentioned by Louviere and Timmermans (1990), an experiment must be designed in a way to evaluate combinations of attribute-levels to obtain information about attribute impacts on utility. Best-worst attribute-level DCEs provide such an experimental design to attain these impacts.
Following the setup as described by Street and Knox (2012), there are K attributes that describe the products denoted as Ai with each attribute consisting of levels for = 1, … , . In the study done by Knox et al. (2012) and Knox et al. (2013) for contraceptive data, there were = 7 attributes, with attribute levels = 8, = 3, = 4, = 4, = 8, = 9, and = 6. One of the attributes is the contraceptive's effect on acne, and the levels associated with that attribute are no effect, improves, or worsens acne symptoms. Each product is represented by a profile = ( , … , ) in which is the attribute level for that makes up the product where the attribute-levels take values from 1 to for = 1, … , . The choice task considered here is to look at the pairs of attribute-levels and build a utility function over time. For every profile, the choice set (pairs of attribute levels) is then given as: where the first attribute-level is considered to be the best and the second is the worst. From the profile , the decision maker evaluates the choice set and determines from the = ( − 1) choices given which is the Under the conditional logit, the probability that ( − ), also denoted is chosen as Equation (2), with the choice of the scale function b( ) = exp(β Ai + β Ai x ij ) = exp( ) becomes Equation (4). This is easily seen by: We assume the error terms come from a Type I extreme value distribution and use the conditional logit to estimate the parameter vector: To estimate these parameters, the following identifiability condition defined on the parameters of the attribute-levels must be met: for all k = 1,2,...,K (Street & Burgess, 2007;Flynn et al., 2008;Grasshoff et al., 2003).
The log-likelihood for estimating the model parameters based on a random sample of n decision makers is given as: in which the response variable representing the choices within each of the choice sets for the experiment are donated as: for i = 1,2,...,G, s = 1,2,...,n and j = 1,2,...,τ. Lancsar et al. (2017) suggested connecting models, their parameters in estimating analysis and producing measures that are related to policy and practice. We include the time feature in Case 2 BWS model structure.

Functional Form of Attribute-Level Best-Worst Discrete Choice Model
Van Der Pol et al. (2014) presented the systematic components of the utility defined as linear functions, quadratic functions, or as stepwise functions of the attributes. Grasshoff et al. (2013) defined the functions as regression functions of the attributes and attribute-levels in the model.
In the attribute-level best-worst DCEs, the utility of the pairs is composed of the utility corresponding to the best attribute-level and the worst attribute-level. The regression functions presented in Grasshoff et al. (2003) are applied to the attributes and attribute-levels within the respective systematic components. Let f be the set of regression functions for the best attribute-levels in the pairs and g the set of regression functions for the worst attribute-levels in the pairs. The × 1 parameter vector still must satisfy the identifiability condition given in Equation (5).
Taking the systematic component defined in Equation (3), the functional systematic component for the pair ( , ) is defined as: in which j, = 1,2,...,K, ≠ , and i = 1,2,...,G The probability that an alternative is chosen depends on the definition of the utility and the distribution of the error terms. Referring back to Equation (4) under the conditional logit, the probability is: in which i = 1,2,...,G, j, = 1,2,...,K, and ≠ .
In the traditional attribute-level best-worst DCE, the regression functions f and g are defined as indicator functions. The indicator functions are × 1 vectors. For the attributes, they are defined as, and for the attribute-levels as, in which j,k = 1,2,...,K and i = 1,2,...,G.
By rewriting the indicator functions of the A k and A k x k , a more general form of the regression functions can be defined. Let b Ak and b Ak x k be constants corresponding to the best attribute and attribute-levels in a pair, and w Ak and w Ak x k be constants corresponding to the worst attribute and attribute-levels in a pair, in which x k = 1,2,...,l k and k = 1,2,...,K. The regression functions f and g are given as, and in which j, = 1,2,...,K, ≠ , and i = 1,2,...,G.
The regression functions defined in this way provide flexibility for the traditional attribute-level best-worst DCEs. Since consumer preference for products are constantly evaluated, the data collected on a product may be dynamic. The addition of these constants to the regression functions provides researchers the ability to scale the data to reflect current trends or changes in the products. For example, let us consider that the products being modeled are pharmaceuticals such as the contraceptives proposed in Knox et al. (2012) and Knox et al. (2013). If new information about a brand of contraceptives posing a health risk was discovered, then using regression functions, it is possible to update the model to reflect this change. Assuming the change is to remove the brand, the attribute-level associated with the brand may have b kxk = w kxk = 0, in which x k = 1,2,...,l k and k = 1,2,...,K represent its removal from the market. For all the pairs this attribute-level was in, the information the choice pair provides in terms of the other attributes and attribute-levels would remain intact. The model would be estimated again and the parameter vector, , would provide the updated impact of the attributes and attribute-levels in the experiment.

Time Dependent Modelling Under Markov Decision Processes
Markov decision processes (MDPs) are sequential decisions-making processes. MDPs seek to determine the policy or set of decision rules, under which maximum reward over time is obtained. MDPs are defined by the set ( , , ), in which is the finite set of states, the set of rewards, and the set of decisions. These processes may be discrete or continuous in time with a finite or infinite horizon, respectively. Our interest is with discrete time finite horizon MDPs, that is in which is a fixed number of time periods. The rewards (or expected rewards) are maximized by the best sequential decisions over time, making MDPs a dynamic optimization tool as used in Blanchet et al. (2016) to identify the right choices of substitution behaviors of the decision makers.

Let
∈ be the states occupied at time , ( ) be the reward associated with , and ( , ) be the decision based on the possible rewards and states at time . The decision process maps the movement from one state to another over time based on rewards received and on an optimal decision set. As the decision process is Markovian, the transition probability to the next state, is based solely on the decision made at the current state, is ( | ), in which = 0,1, … , (Puterman, 2014). There is a decision rule that governs the action the decision maker makes and rewards the results from the action. The decision of choices is made such that it maximizes their rewards. Rust (1994) and Arcidiacono and Ellickson (2011) applied MDPs to DCMs. Chades et al. (2014) applied them to solve problems in an ecological setting. As they mentioned, to suggest guidance would require running several cases. To our knowledge, such a technique has not yet been applied to decision maker choice experiments with attribute and attribute-level best-worst experiments.
For DCMs, the reward is defined by the utility function, ( , ) = ( , ), in which = ( ) is the decision rule at time t that maximizes the utility, and the decision rule is the one that maximizes the expected discount utility given as the value function.
The value function for DCMs comes from Bellman's equation and is given as: in which the discount utility rate is given by ∈ (0, 1). The steps for determining the value function follow.
The decision rule used by a decision maker is the one under which the utility is maximized, but assuming that a person's perceived utility is impacted by time. Frederick et al. (2002) reviewed the work done on the discount utility including the decision makers' discount time factor step. The discount utility rate weights the utility a person gains from an option at some ulterior time based on their current state at time t and guarantees the convergence in the infinite sum of rewards.
MDPs model the sequence of decisions based on expected rewards and transition probabilities. We defined state transition as, Since no closed form expression for this dynamic optimization problem is available, the value functions are computed recursively via dynamic programming under backwards recursion algorithm. First, we compute, with ( ) as the probability that was made. We denote as the transition probability of decision given previous decision for = 1, … , . Next, we move one-time step back and compute, and another, Following this pattern, we get: ∈ , for = 1, … − 1. For these experiments, we considered discrete time finite horizon MDPs where: • choice sets are modeled across time of length .
• are the attributes and attribute-levels corresponding to the choices in , for = 1, … , . • The decision set depends on the choice set evaluated ∈ in which = 1, … , , and = 1, … , .
• Transition probabilities depend on a set of parameters that are assumed to be known or data estimable.
is a function of an attribute and attribute level not necessarily identical to , as described in Arcidiacono and Ellickson (2011).
• Transition probability matrices are dependent on time and on the choice set being evaluated. There are choice sets with = ( − 1) choice pairs in each set. To compute the transition probabilities, the parameters are assumed known (Arcidiacono & Ellickson, 2011). Let is the set of parameters guiding the transition from to , for = 1, … , . In Case 2 BWS models, the parameters would be the measure of relative impact/preference associated with the attributes and attribute-levels corresponding to the different choice pairs, or states, given the current state is , in which = 1, … , . Rust (2008) and Arcidiacono and Ellickson (2011) stated that is an assumed known under some rationale with regards to decision maker behavior or preferences.
The parameter estimates determined by fitting the conditional logit model, as described in Section 2, produce a = + ∑ length vector. These parameter estimates measure the relative impact of each attribute and attribute-level in the decisions made by the decision makers. The parameters are the assumed impacts of the attributes and attribute-levels in the decision maker's decisions, given they currently occupy state . We define these parameters as functions of the parameter estimates , in which there is a rate of change in the impacts over time, as follows: These ( ) are rates of change that guide the dynamic transition of the decision process. We can easily consider them to be non-time dependent, ( ) = , defining the transition probabilities as stationary over time. As was mentioned earlier, there are infinite possibilities in how we define the transitions. Rust (2008) stated that when using rational observation to define the transitions, many possible choice behaviors by the decision makers are possible. Chades et al. (2014) recommended running many cases to determine the transition probabilities that will maximize the expected reward. Our definition also offers infinite possibilities in terms of

Data Simulation
In the simulated example, an empirical setup is considered. We assume K = 3 attributes with l 1 = 2, l 2 = 3, and l 3 = 4 attribute-levels in an unbalanced design. There are 2 × 3 × 4 = 24 possible profiles, or products, in this experiment. The total number of attribute-levels is = ∑ = 9 , and the total number of choice pairs is = ∑ ( − ) = 52 . Louviere and Woodworth (1983), Street and Knox (2012), and Grasshoff et al. (2004) discussed the benefits in using orthogonal arrays. Generally, orthogonal experimental designs are utilized in attribute-level best-worst DCEs due to the large number of profiles in a full factorial design. There is a package in R called DoE.design that creates full factorial and orthogonal designs for a given set of attributes and attribute-levels. To obtain an orthogonal design, the oa.design function is used. For this experiment, the orthogonal design returned the full factorial design, so we used the full set of 24 profiles when simulating this data.
We simulated data for n = 300 respondents for 24 profiles. Each choice set has τ = K(K − 1) = 6 choices to choose from. Using the parameters given in Table 1, we simulated data in R. The data was then exported from R into the SAS R environment. Using the SAS R procedure called MDC (multinomial discrete choice), the conditional logit model was fitted to the data. The parameter estimates for the generated data are given in Table 1.
The parameter estimates are close to the original parameters for this example. Using the parameter estimates, the choice utilities were computed and are used to determine the expected utility/value function. The best and worst 3 choice pairs along with their utilities are presented in Table 2 and Table 3, respectively. As expected, the opposite of the pairs with the highest utilities have the lowest utilities. We Also consider an example where the model is built on the regression functions f and g of the data. We define f and g as given in Equations (10) and (11). The weights used in the regression functions are given as: The conditional logit model is fit to the data and the resulting parameter estimates are given in Table 1. The parameter estimates provide the adjusted attribute and attribute-level impacts. To build preference choices over time, we next extend the Case 2 BWS experiment of choice pairs and describing the optimal variation over T = 5 time periods. Under that experiment, the decision maker chooses an alternative that provides maximum utility of attributes and attribute-levels over time. Numerical maximization to find the expected utility under Bellman's equation of the MDPs will be used under two cases. Under Case 1, stationary transition probabilities are considered while dynamic transition probabilities are presented under Case 2.

Case 1: Stationary Transition Probabilities
We ran the simulation under this case with an advantageous proposed structure. The intent is to validate/justify our relative performance over time under stationary sparsity.
In this example, respondents are assumed to make similar decisions at each decision epoch that they made at the previous time point. The transition parameters where = ( , ), are defined as for the attributes as, and for the attribute-levels, where ≠ , j, ,k = 1,2,...,K, 1 ≤ x k ≤ l k , and i = 1,2,...,G. The transition parameters do not change with time, so the transition matrix is stationary. The goal of this case was to design the transition probabilities in a way that the choice made at t is most likely to be made at t + 1. If we considered a si m(t) = β m for i = 1,2,...,G, and m = 1,2,...,p, then the system would remain static and every row of the transition matrix would be the same. Recall that = + ∑ = 12 is the number of parameters. We consider 1.7|β m | when a state or choice pair at time t+1 has the same best attribute and attribute-level as the state occupied at time t, and −1.7|β m | when a state or choice pair at time t+1 has the same worst attribute and attribute-level as the state occupied at time t. We consider |β m | to control the direction of the impact making sure it is positive for the best attribute and attribute-level of s i and use −|β m | to make sure its negative for the worst attribute and attribute-level of s i . We use 1.7 to increase the impact of the best and worst attributes and attribute-levels of s i . The definition of a si m(t) in this way insures that states with common best and worst attributes and attribute levels as the present state occupied, = ( , ), have a greater probability of being transitioned to, where i = 1,2,...,G, ≠ , j, = 1,2,...,K, and t = 1,2,...,T. The weights associated to the attributes and attribute levels are selected as above.
Referring back to Section 3, the systematic component as a function of the best and worst attribute-level in the pair, is: where f t and g t , are given as: and where j, = 1,2,...,K, ≠ , and i = 1,2,...,G.
g the e 1 in ively.
. The level ghest fts in more