A Comparison of the Optimal Classification Rule and Maximum Likelihood Rule for Binary Variables

Optimal classification rule and maximum likelihood rules have the largest possible posterior probability of correct allocation with respect to the prior. They have a ‘nice’ optimal property and appropriate for the development of linear classification models. In this paper we consider the problem of choosing between the two methods and set some guidelines for proper choice. The comparison between the methods is based on several measures of predictive accuracy. The performance of the methods is studied by simulations.


Introduction
Optimal classification rules and maximum likelihood rule are widely used multivariate statistical methods for analysis of data with categorical outcome variables.Both of them are appropriate for the development of linear classification models, i.e. models associated with linear boundaries between the groups.Binary classification is the task of classifying the elements of a given set into two groups on the basis of a classification rule.
obtained are presented and discussed in section 5 and conclusions and recommendations are given in section 6.

The Optimal Classification Rule
Independent Random Variables: Let 1  and 2  be any two multivariate Bernoulli populations.Let ) / ( j i c be the cost of misclassifying an item with measurement x from j  into i  and let j q be the prior probability on i  , where 2 , 1  i with 1 2 1   q q and probability mass Function . Suppose that we assign an item with measurement vector x to 1 . The expected cost of misclassification is given by: where The optimal rule is the one that partitions r R such that Therefore the optimal classification rule with respect to minimization of the expected cost of misclassification (ECM) is given by classify object with measurement 0 Without loss of generality, we assume that 2 / 1 2 1   q q and c(1/2)=c(2/1).Then the minimization of the ECM becomes the minimization of the probability of misclassification, p(mc) under these assumptions, the optimal rule reduces to classifying an item with measurement 0 x into 1 Otherwise classify the item into 2  .Since x is multivariate Bernoulli with P ij >0, i=1,2, j=1,2…r the optimal rule is: classify an item with response pattern (2.9) Otherwise, classify the item into 2  .This rule simplifies to: Classify an item with response pattern Otherwise, classify into 2  .
For any rule, the average or expected cost of misclassification (ECM) is provided by the product of the off-entries by their probabilities of occurrence.A good classification rule should have an ECM as small as possible.The regions R 1 and R 2 that minimize the ECM are defined by the values x for which the inequalities are defined.
If the parameters are unknown, then they are estimated by their maximum likelihood estimators given by is equal to the number of observation from i  with jth variable.The rule for unknown parameters is: classify an item with response pattern Otherwise, classify the item into 2  .Written in another form the rule simplifies to: classify an item with response pattern x into 1 To find the distribution of z we note that

Optimal Rule for a Case of Three Variables in Two Group Classifications.
Suppose we have three independent variables according to Onyeagu (2003), the rule is: classify an item with response pattern x into 1 : 3 q q q q q q In x p q q p In x p q q p In x p q q p In R B (2.2.1) otherwise, classify the item into 2  .Written in another form the rule simplifies to: classify an item with response pattern x into 1  if: , q q q q q q In c p q q p In w (2.2.3)

Optimal Rules for a Case of Four Variables in Two Group Classifications
Suppose we have four independent Bernoulli variables, the rule is classify an item with response pattern x into In q q In q q In q q In x p q q p In x p q q p In x p q q p In x p otherwise, classify the item into 2  .Written in another form, the rule simplifies to: classify an item with response pattern x into 1 otherwise, classify the item into 2  .For the case of four variables, let The distribution function is derived just the same way as the case of three variables.Using the same method the probability mass function of z and the distribution function for the case of five variables could be derived.

Probability of Misclassification
In constructing a procedure of classification, it is desired to minimize on the average the bad effects of misclassification (Onyeagu 2003, Richard and Dean, 1988, Oludare 2011).Suppose we have an item with response pattern x from either 1  or 2  .We think of an item as a point in a r-dimensional space.We partition the space R into two regions R 1 and R 2 which are mutually exclusive.If the item falls in R 1 , we classify it as coming from 1  and if it falls in R 2 we classify it as coming from 2  .In following a given classification procedure, the researcher can make two kinds of errors in classification.If the item is actually from 1  , the researcher can classify it as coming from 2  .Also the researcher can classify an item from 2  as coming from 1  .We need to know the relative undesirability of these two kinds of errors in classification.Let the prior probability that an observation comes from 1  be 1 q , and from 2  be 2 q .Let the probability mass function of 1 . Let the regions of classifying into 1  be R 1 and into 2  be R 2 .Then the probability of correctly classifying an observation that is actually from Similarly, the probability of correctly classifying an observation from 2 The total probability of misclassification using the rule is In order to determine the performance of a classification rule R in the classification of future items, we compute the total probability of misclassification known as the error rate.Lachenbruch (1975) defined the following types of error rates.

(i).
Error rate for the optimum classification rule, R opt .When the parameters of the distributions are known, the error rate is which is optimum for these distribution.

(ii)
Actual error rate: The error rate for the classification rule as it will perform in future samples.
(iii) Expected actual error rate: The expected error rates for classification rules based on samples of size 1 n from 1  and 2 n from 2  (iv) The plug-in estimate of error rate obtained by using the estimated parameters for 1  and 2  .

(v)
The apparent error rate: This is defined as the fraction of items in the initial sample which is misclassified by the classification rule.(2.4.4) Hills (1967) called the second error rate the actual error rate and the third the expected actual error rate.Hills showed that the actual error rate is greater than the optimum error rate and it in turn, is greater than the expectation of the plug-in estimate of the error rate.Fukunaga and Kessel (1972) proved a similar inequality.An algebraic expression for the exact bias of the apparent error rate of the sample multinomial discriminant rule was obtained by Goldstein and Wolf (1977), who tabulated it under various combinations of the sample sizes n 1 and n 2 , the number of multinomial cells and the cell probabilities.Their results demonstrated that the bound described above is generally loose.

Evaluating the Probability of Misclassification for the Optimal Rule R opt
The optimal classification rule R opt for ) ... , ( which is distributed multivariate Bernoulli is: classify an item with response pattern x into 1   .The probability of misclassification using the special case of R opt is ) 2 / 1 ( q q p p q q p r q q p q q r j j In rIn B In  The probability of misclassification using the special case of R opt when We plug in this estimate into the rule for the general case in 1(a) to have the following classification rule: classify item with response pattern x into 1 The maximum likelihood estimate of 1 p is We plug in these two estimates into the equation for the special case (1b) to have the following classification rule: classify the item with response pattern x into 1 In rIn x (2.5.18)

Otherwise classify into 2 
The probability of misclassification is given by (2.5.20) (2.5.22)

Maximum Likelihood Rule (ML-Rule)
The maximum likelihood discriminant rule for allocating an observation x to one of the populations 1  ,.. n  is to allocate x to the population which gives the largest likelihood to x.That is the maximum likelihood rule says one should allocate x to ij  when where 2  g the rule allocate x to 1
A simulation experiment which generates the data and evaluates the procedures is now described. .These samples are used to construct the rule for each procedure and estimate the probability of misclassification for each procedure is obtained by the plug-in rule or the confusion matrix in the sense of the full multinomial. (ii) The likelihood ratios are used to define classification rules.The plug-in estimates of error rates are determined for each of the classification rules.

(iii)
Step (i) and (ii) are repeated 1000 times and the mean plug-in error and variances for the 1000 trials are recorded.The method of estimation used here is called the resubstitution method.
The following table contains a display of some of the results obtained  4.2(a) and (b) present the mean apparent error rates and standard deviation for the classification rules under different parameter values.The apparent error rates increases with the increase in the sample sizes.

Classification Rule Performance
Maximum Likelihood (ML) 1 Optimal (OP) 2  4.3(a) and (b) show the mean apparent error rates and standard deviation (actual error rates) for the classification rules under different parameter values.It is clear to see that the mean apparent error rate increases with the increase in the sample sizes.The standard deviation decreases with the increase in sample sizes.As the number of variables increases, the performance of the maximum likelihood decreases.From the analysis optimal rule is ranked first, followed by maximum likelihood.

Conclusion
Maximum likelihood procedure performed well for small and moderate number of variables irrespective of the sample size while optimal classification rule appears to be more consistent for small, moderate and large number of variables.Therefore, optimal is more effective classifier than maximum likelihood.
with parameters r and p (2.5.5) Table 4.3(a) Apparent error rates for classification rules under different parameter values, sample sizes b) Actual error rate for the classification rules under different parameter values, sample sizes and replications.
for the r-variate Bernoulli models becomes: classify an item with response pattern x into 1 opt

Table 4 .
1(b) Effect of input parameters P 1 and P 2 on classification rules at various values of sample size and ) present the mean apparent error rate and standard deviation (actual error rates) of two classification rules.The apparent error rates increases with the sample size.From the table 4.1(b) the error rates decreases with the sample size.With n =1000, two classification rules have the same error rate.On the average, maximum likelihood ranks first, followed by optimal.