Model Selection for Poisson Regression via Association Rules Analysis

This study integrates association rules analysis, a methodology for selecting potential interactions, with Poisson regression modeling. Though typically ignored in conventional Poisson regression, interactions are very common in practice. However, selecting a Poisson regression model when many main effects and interactions are involved is problematic. In this study, we develop a model selection framework to address this problem. Specifically, we focus on building an optimal Poisson regression model by (1) discretizing the response and quantitative attributes into levels; (2) exploring via association rules analysis combinations of input variables that have a significant impact on response; (3) selecting potential (lowand high-order) interactions; (4) converting these potential interactions into new variables; and (5) selecting variables from all the input variables and the newly created variables (interactions) to build the optimal Poisson regression model. Our model selection procedure is the first approach to enable a global search for potential interactions and the first to establish the optimal combination of main effects and interaction effects in the Poisson regression model. A real-life example is given for illustration. It is shown that the proposed method finds the optimal model including important interactions that cannot be found by other existing methods.


Introduction
The Poisson regression model is one of the most important models for count data.It has been used to explore relationships among variables in a wide variety of areas, including economics, health care, demography, business, and manufacturing.However, selecting a Poisson regression model when many main effects and interactions are involved is complicated.Interactions are very common in practice.Yet, in the conventional Poisson regression, they are typically ignored.
In this paper, we propose a systematic procedure designed to consider the interactions among variables and thereby produce models that are superior to those developed using established methods.We consider both low-order and high-order interactions among variables.Note that we also consider interactions among categorical variables, which are usually neglected in the literature.However, it is both impractical and ineffective to include all the possible interactions in a model selection process.Instead, we apply (a data-mining technique) association rule analysis to find potential interactions for the model selection procedure.
Association rules analysis is a methodology that aids in the selection of potential interactions among categorical variables from a large pool of possibilities.Using this method, we are able to narrow the field of possible combinations by eliminating interactions that are unlikely to contribute to the fit of the Poisson model.
Our model selection framework overpowers the ability of classical model building to consider potential interactions.Association rules analysis streamlines the process of selecting important rules, converts the selected rules into interaction variables, and determines the optimal model for Poisson regression by implementing a subset selection method that considers all the main effects and the potential interactions.The key advantages of the proposed framework include its ability to deal with a large number of interactions, its ability to select potential interactions, and its provision of alternative setups for interactions.In the proposed method, interactions are incorporated into the Poisson regression model.Higher-order interactions can also be included in the model-a feature that differentiates our model selection approach from other Poisson regression modeling approaches.
This paper is organized as follows.Section 2 reviews the relevant academic literature pertaining to Poisson regression modeling and association rules analysis.Section 3 presents the framework and discusses the proposed method in detail.Section 4 presents an application of our framework to a real dataset.Section 5 offers a discussion and concluding remarks.
A Poisson regression model uses a log link to explain the relationship between the response, i.e., count data, and its explanatory variables (Agresti, 2002).The Poisson regression model with only main effects is written as: where μ is the expected value of response, the X's are the main effects, and the β's are the coefficients of those main effects.
However, selecting a Poisson regression model when many main effects and interactions are involved is complicated.Interactions, especially those between categorical variables, can be addressed in different ways.In the literature on the categorical regression model, interactions among categorical variables are generally neglected.Given that the number of interactions increases at an accelerated rate as the main effects increase in number, the subset selection method is extremely inefficient when all the interactions are considered at the same time.Therefore, an efficient method for selecting potential interaction variables is required.Here, we develop a methodology for implementing association rules analysis for exactly this purpose.As will be seen, the proposed method is able to find the optimal model when the interaction effect is present.
Association rules analysis is a methodology for exploring relationships among items in the form of rules.Each rule has two parts: the first part pertains to left-hand side item(s) or the condition(s), and the second part to the right-hand side item or the result.The rule is always represented as a statement: If condition, then result (Berry & Linoff, 1997).Two measurements are attached to each rule.The first measurement, support (s), is computed by s = Prob(condition and result).The second measurement, confidence (c), is computed by c = Prob(result | condition).Association rules analysis finds all the rules that satisfy both of these key thresholds: minimum support and minimum confidence (Agrawal & Srikant, 1994).
This set of rules can be used for other purposes, including classification.A technique called classification rule mining (CRM), a subset of association rules analysis, was developed to form an accurate classifier by finding a set of rules in a database (Liu, Hsu & Ma, 1998;Quinlan, 1992).This technique uses an item to represent a pair consisting of a main effect and its corresponding integer value.More specific than association rules analysis, CRM has only one target, which must be specified in advance.In general, the target of CRM is the response, which means the result of the rule (the right-hand item) can only be the response and its class.Therefore, the left-hand item (the condition) consists of the explanatory variable and its level.For example, assume that there are k categorical factors, X 1 , X 2 , …, X k , and a categorical response Y.Many rules can be generated by CRM.One such rule could be "If X 1 = 1, then Y = 1" with s = P(X 1 = 1 and Y = 1) and c = P(X 1 = 1 and Y = 1)/P(X 1 = 1).Another rule could be "If X 1 = 1, then Y = 0" with s = P(X 1 = 1 and Y = 0) and c = P(X 1 = 1 and Y = 0)/P(X 1 = 1).
In the present study, we use CRM to screen out insignificant or irrelevant interactions and consider only those that are potentially significant for the Poisson regression model.This methodology is a major aspect of the selection of variables in our process.To our knowledge, there are no studies linking association rules analysis and classification rules mining to Poisson regression modeling.The methods proposed by Changpetch and Lin (2013a) and Changpetch and Lin (2013b) are limited to binary response variable and multinomial response cases.Poisson regression model is very important and different from the logistic regression model and the multinomial logit model.For the Poisson regression model, there can be myriads of different numbers for the response.Therefore, it is necessary to discretize the response before applying association rules analysis.This is an additional work that is particularly important for the Poisson regression model.Moreover, we also discretize the quantitative attributes in the same step to improve our approach's ability to handle the interactions between the categorical and quantitative variables as well.Therefore, we include the extra step of discretization from the four-step approach proposed by Changpetch andLin (2013a, 2013b).
Several studies use Poisson regression in conjunction with other data-mining techniques in support of achieving other objectives.For example, Chaudhuri, Lo, Loh, and Yang (1995) used decision tree analysis to find subnodes for a decision tree as a basis for estimating the Poisson regression model for each subnode.Here, we focus on searching for potential interactions for the Poisson regression model.The decision tree is the other data-mining technique that can be used to find interactions.However, the decision tree has the disadvantage of a hierarchical structure.Therefore, as association rules analysis enables a global search, more potential interactions can be located and thus considered than is possible with the decision tree structure.

The Proposed Method
The proposed framework for building a Poisson regression model for count data consists of five key steps.As shown in Figure 1, the five steps in our framework are as follows: Step 1: Discretize the response and the quantitative variables into categorical variables.
Step 2: Generate the rules from association rules analysis.
Step 3: Select the rules based on confidence.
Step 4: Generate the variables for each rule from step 2.
Step 5: From the variables in step 3 and all the main effects, search for the optimal model.Figure 1.Framework for building the proposed model.
Step 1: Discretization Given that association rules analysis works with categorical variables, we first discretize the response and the quantitative attributes into categories.For recommended level for discretization, we recommend between 3-5 levels since we need at least three levels for response and quantitative variables, i.e. low level, medium level, and high level.Based on our empirical studies, if we divided into just two levels, we will lose potential significant interactions.If we divide into too many levels, a large number of rules with small support will be resulted.We will then face a screening problem, some difficulties finding the rules (among so many potential rules) that significantly contribute to the model.The practitioner can divide the data into levels based on distribution or available criteria.When there is no existing reference, the simplest way to discretize the response is to divide all the observations equally for each level.For example, if we discretize the response into 4 levels, each level will contain 25% of the observations.For the quantitative variable, after dividing it into levels, we will convert it into dummy variables, which is similar to the way in which we convert categorical variables into dummy variables.
Step 2: Association Rules Analysis First, we use association rules analysis to create rules from datasets.Specifically, we perform CRM.For each rule, the condition (left-hand items) represents the combination of explanatory variables and their levels, whereas the result (the right-hand item) is the response and its class.To perform CRM, we use the CBA program (Liu, Hsu & Ma, 1998) developed by the Department of Information Systems and Computer Sciences at the National University of Singapore (website: http://www.comp.nus.edu.sg/~dm2/).With this program, we are able to obtain all the active rules using the given minimum support and minimum confidence.For Poisson regression, we do not restrict the level of minimum support but recommend a minimum confidence level of 80%.The expected results from this step are the rules in the form "If X i 's = x i 's, then Y = y," where x i is the level of variable X i and where y is the level of response Y.With each rule, the respective support and confidence are attached.All active rules become inputs for the second step.

Step 3: Rule Selection
In this step, we select the rules to convert into interaction variables for the next step.The rules selection criterion used here is confidence.Therefore, rules with the highest level of confidence are selected from the active rules obtained in the first step.We call the rules selected at this stage potential rules.Note that the number of rules selected at this stage is relatively small compared to the total number of possible interactions for the dataset.In this work, we set the number of potential rules at between 30 and 50 (the same number recommended in Changpetch and Lin (2013a) and Changpetch and Lin (2013b)).The higher the number of variables, the higher the number of potential rules we select.All the potential rules are inputs for the third step.

Step 4: Variable Generation
In this step, we generate the variables for the Poisson regression model from the potential rules.To convert a rule into an interaction, we separate the rule into two cases.In the first case, the variables on the left-hand side are originally categorical variables.We generate interactions among the main effects on the left-hand side with the same settings that appear in the rule.Assume that the selected rule has three predictors with the form "If X i = x i , X j = x j , and X k = x k , then Y = y," where x i is the level of variable X i , x j is the level of variable X j , x k is the level of variable X k , and y is the level of response Y.We generate an interaction among X i , X j , and X k by labeling this interaction as 1 if X i = x i , X j = x j , and X k = x k , and as 0 otherwise.This interaction is denoted by X i (x i )X j (x j )X k (x k ).For example, for the rule If X 1 = 0, X 2 = 1, and X 3 = 1, then Y = 0, we create an interaction among X 1 , X 2 , and X 3 denoted by X 1 (0)X 2 (1)X 3 (1).We have X 1 (0)X 2 (1)X 3 (1) = 1 if X 1 = 0, X 2 = 1, and X 3 = 1, and 0 otherwise.Note that the level of Y does not play any role in generating the variables.
The second case is when variable(s) on the left-hand side include any variable that has been converted from the quantitative variable(s).For example, the selected rule is "If X i = x i , X j = x j , X k = x k , and X q = x q , then Y = y," where X q is the variable converted from the quantitative variable and x q is the level of variable X q .We generate an interaction among X i , X j , X k , and X q by multiplying the original variable of X q with the generated interaction X i (x i )X j (x j )X k (x k ), as in the previous example.Assume that the original variable of X q is Q and that the generated interaction is referred to as X i (x i )X j (x j )X k (x k )Q.The results from this step are inputs for the fifth step.

Step 5: Model Selection
In principle, any model selection criterion can be used.Here, the Bayesian information criterion (BIC) is used for illustration (Schwarz, 1978).The best subset selection method is performed by testing all the possible combinations of variables and selecting the one that gives the optimal BIC.In other words, the model selected is the one that gives the lowest BIC among all the models.We use the original response (before discretization) for the model fit.The variables in this step consist of all the interactions generated in Step 4 plus all the main effects.Note that there are many choices in this model selection step.Other advanced model selection methodologies, e.g., the least absolute shrinkage and selection operator (lasso) (Tibshirani, 1996), the smoothly clipped absolute deviation (SCAD) (Fan, 1997), the adaptive lasso (Zou, 2006), the least-angle regression algorithm (LARS) (Efron, Hastie, Johnstone & Tibshirani, 2004), the Dantzig selector (Candes & Tao, 2007), or another kind of selection criteria, e.g., the Akaike information criterion (AIC) (Akaike, 1974), the Corrected Akaike information criterion (AICc) (Hurvich & Tsai, 1993), the Corrected Bayesian information criterion (BICc) (Tremblay & Wallach, 2004), the significance of each individual variable criterion; can be applied here.Note that each association rule can be converted into only one interaction.The final selected model will be the optimal model with a combination of main effects and potential interactions.

Application: Patent Applications Dataset
In this section, we use a patent applications dataset (Cincera, 1997) to demonstrate how to apply our proposed method.We select the model for this dataset using our framework.Note that we use a portion of the dataset used in Cincera (1997), which contains records of 181 firms with variables, as shown in Table 1.The objective is to construct the model in order to explain and predict the number of patent applications in 1991 in reference to the available variables including industry sector, firm location, availability of patent applications in 1990, and R&D expenditure.X R1 = 1 if R&D expenditure is in the first 25%, and X R1 = 0 otherwise X R2 = 1 if R&D expenditure is in the second 25%, and X R2 = 0 otherwise X R3 = 1 if R&D expenditure is in the third 25%, and X R3 = 0 otherwise X R4 = 1 if R&D expenditure is in the last 25%, and X R4 = 0 otherwise We applied the proposed method to this dataset and obtained the following results: Step 1: Discretize the response into four levels.We followed the four levels for patent applications (Schwalbach & Zimmermann, 1991) as summarized in Winkelmann (2008).We converted R&D expenditure, the quantitative variable, into a four-level categorical variable and thereby generated four dummy variables, as shown in Table 1.
Step 2: Use CBA to obtain the active rules.We used a minimum confidence value of 80% to generate the active rules.Note that this dataset has a large number of variables compared to number of instances.
Step 4: Convert the 50 selected rules into variables.For example, rule 1 was converted into the new variable referred to as X 4 (1)X 20 (1)X R .Note that X 23 was generated from X R , which is the quantitative variable.Therefore, the conversion followed the second case of the variable generation (Step 4 in Section 3).To generate the new variable, we multiplied the original quantitative variable X R with the dummy variable: . 0 Step 5: Apply the subset selection method to find the proposed model.Note that the candidate variables comprised the main effects (X 1 , X 2 , …,X 24 ) and all the potential interactions from Step 3.
We selected the ten-variable model which is the model with the best BIC to represent this dataset.The model is  Some major findings from the above model can be explained as follows: i.Among all sectors, the estimated mean for patent applications is highest in electricity sector and lowest in vehicle sector.
ii.Among all areas, the estimated mean for patent applications from Japan is highest and the estimated mean for patent applications from USA is lowest.
iii.The estimated mean for patent applications in 1991 when there was at least one patent applications in 1990 is higher compared to the estimated mean for patent applications in 1991 when there was no patent applications in 1990.
iv.The estimated mean for patent applications increases with an increase in R&D expenditure.
v. If the area is drugs and the patent applications in 1990 is more than 0, an increase in R&D expenditure will decrease the estimated mean for patent applications.
vi.If the area is electricity and the patent applications in 1990 is more than 0, an increase in R&D expenditure will decrease the estimated mean for patent applications.
We stopped the search process at ten variables because the ten-variable model outperforms all the classical models, based upon the main effects only.Moreover, the proposed model also outperforms all the classical models based on the AIC criteria and deviance.Note that the AIC and BIC for this model are 8,047 and 8,112 whereas the AIC and BIC for the optimal classical model are 8,111 and 8,202, respectively.The deviance for this model is 8,025 whereas the optimal deviance among all the classical models is 8,075.
The model selected through the process described has two interactions, which when added to the Poisson regression model can elevate its performance tremendously as compared with the classical models.Therefore, compared to the classical model, the proposed method better explains the patents dataset and improves the prediction of patent applications.Note that the two interactions cannot be found by any available variable selection methods for the Poisson regression model.
The selected ten-variable model is as shown earlier.The three-way interactions are found from the proposed method.However, for those who believe in the hierarchical principle, the main effects and the two-way interactions of the variables that are involved in the three-way interactions can be added to the model as well.

Discussion and Conclusion
A central problem in Poisson regression modeling is accounting for interaction effects.In this study, we apply association rules with Poisson regression modeling with the central purpose of capturing possible low-order and high-order interactions.Our study confirms that the methodology proposed herein can be used effectively to select interactions that improve model fit and to further our understanding of any given dataset.As shown via the real dataset, the results affirm the effectiveness of the proposed method used with Poisson regression modeling: specifically, the proposed method can provide a better explanation and a better fit for Poisson regression modeling than the classical method does.Note that it is not possible to select the two interactions in the proposed model using any conventional approach.
Among the data-mining techniques, the decision tree is capable of finding interactions among variables.However, the decision tree has the disadvantage of a hierarchical structure.Association rules analysis has an important advantage over the decision tree: association rules analysis enables a global search such that more potential interactions can be considered than is possible with a decision tree structure.
This paper shows that we can apply the association rules analysis to help finding important interactions not only from categorical variables but also from quantitative variables and with response that has myriads of different numbers by adding discretization into the process.This also leads us to believe that the proposed approach is also effective for any other linear regression models.

Table 1 .
Attributes for the Patent Applications Dataset