Estimation of Project Completion Time-Based on a Mixture of Expert in an Interactive Space

Estimation the time and the cost of completing projects on the basis of decision making to use either of the estimation methods are one of the most important issues in project management. In this paper, a decision making database of learning machines, is proposed, that a set of possible estimator are working together to estimate the project completion time, in it. This cooperation is based on samples neighborhood in the feature space. One of the important issues, that learning machines are facing it, is the complexity in feature space, because of features with high-correlation. In this paper, to avoid this problem, principal component analysis (PCA) method is used to accuracy has increase, addition to, increasing in system speed. Moreover, methods based on the ensemble, have a higher reliability and ability to generalization, compared to single methods. Furthermore, the hybrid method, (PCA and ensemble), have all the above mentioned advantages. Therefore, system reliability control, using more powerful learning machines, in ensemble, and also ability of the proposed model, to manage existing poor estimators, in ensemble, are other important features of this method. In the end, a software code was created, which provides ability to connect to MSP.


Introduction
One of the most effective methods in predicting the time and the cost of project completion is use of earned value management.Formula-based methods were used frequently in the past, but nowadays, with increasing volume of received information, from projects, an increasing need to more powerful and more reliable methods, is felt.The methods based on data mining, gradually have widely been used as one of the important methods in this field.Among these methods, can refer to the methods based on neural network, which are used in [1]; in this paper, a neural network with 5 inputs and 5 outputs, and also a hidden layer, trained to predict the actual cost.The process for the minimization of the error is repeated, step by step, until a termination criterion for the system is achieved.Methods based on time series, also have been able to provide, acceptable results in this field, especially, high ability of these methods in predicting future events is the most important factor of these methods [2,3].Methods based on S shape curves are the other methods, which be extensively used in this field [4,5,6].In paper [6], using S shape curves, and the relationship between cash flow and project progress; linear and nonlinear models of project progress, taking into consideration uncertainties in cash flow.In paper [7], a dynamic model obtained from earn value management (EVM), using ARMA from the project progress, and then, in an uncertainty space, time and cost of completion of the work, estimated using Kalman fuzzy method.Support Vector Machine (SVM) is type of learning machines, which has a very high ability in modeling datasets, with a highly complex feature space.Hence, several methods based on SVM have been proposed for estimating the time and cost of projects, which proposed in [8,9,10].In paper [8], a combination of genetic algorithm and SVM has been proposed.Also in [9], a combination of PCA and SVM is used to predict the project time completion.In paper [10], weighted SVM (wSVM), which is an improved model of SVM, is used to reduce the effects of noise and irrelevant data.This model was used with fuzzy logic and genetic algorithms, as fmGA, to calculate Estimate At Completion (EAC).In this article, methods based on fuzzy approach, such as uncertainty in the obtained value calculations, which inputs and outputs of the system, have considered as a fuzzy variable.
In papers [11,12,13], methods based on EVM, reviewed, and new indicators are provided for work in this field.
In paper [14], these three methods are compared with each other, and also the efficiency of cost classic indicators, such as SV and SPI, are compared with the efficiency of new indicators, which are similar to the ( ) and ( ), in units of time.In this paper, we have used 14 different features, which were used for EVM, in paper [14].In Table 1, some features used in these methods are listed.computer vision, and credit scoring [15-19].Also in mentioned papers, only one estimators used for estimating the project.Using only one estimator, in learning process can be problematic.Due to the dynamic nature of the problem, in project management, tuninga learning machine on a specific Dataset, however, can produce good results, but it is very risky for a practical application and has low reliability.Methods, based on ensemble have higher reliability and generalization, because of increasing diversity in decision making process, and also combination of these methods with PCA algorithm, has been able to produce much better results than individual learning machines in ensemble.

Proposed Method
In data mining methods such as classification and regression; first, available data sets should be split into two parts of training set and test set.In the proposed method, how to divide the data set is, in a way that previous carry out projects placed in the training set.It helps to increase knowledge from the problem, and produce a better approximation and also is suitable to the simulation of dynamic project.By developing the current project in each step, produced knowledge is added to the training set, and the rest of current project place in the test set.For the preparation of data collection, first, features are normalized.After normalization of the dataset, one of the crucial problems, that prediction faces it, is the curse of dimensionality phenomena.This problem makes the feature space very complex.Existence characteristics that have a strong correlation together, make the forecast slow, and also can have a negative impact on the accuracy of the proposed method.As it has been mentioned before, EVM features have strong linear relationships with each other, so using all the features can have a negative impact on the forecast.In this paper, we tried to solve the problem using a well-known technique, PCA.
After simplification dataset, and extracting principal components, dataset, is ready to implement the proposed method.In this paper, an ensemble method is used; in which four different estimators interact with each other.This method can be called as a method of ensemble of expert that final result is produced based on a fusion method.In this paper, the similarity between neighboring samples is used in the feature space to different estimators have a collaboration together.As we know, the principle of neighborhood is one of the important issues in the data mining, and in other areas, such as Image processing.Different methods such as K-nearest neighbor (KNN), attracted a lot attention for different applications.
Hence, the process is as follows that initially, all estimators be trained on the training set at every stage of the project.As a result, the estimator learns the learning sets and be prepared to estimate results for data from the test set.This work is performed in this way that, initially, the KNN algorithm run on each data sample in test set.
Based on terms of user K-nearest neighbors are found for each sample of the test, of the training set, then, the neighborhoods ordered as ascending.The nearest neighbors have most closely resembles and farthest neighbors have least similarity to the test sample.Accuracy of learning machines on a single training example, in the previous step is known, according to this, we can identify the best estimator for different K neighbors.Suppose, if among the three existing learning machines, in ensemble, 2 learning machines have the best result on the 5 neighborhoods, so both two learning machines will be used to estimate the result of the test sample.In conclusion, these results should be fused, together.To this work, a weighted function based on distance between samples and test sample in the feature space was proposed.Equation 1shows how calculation of estimated amount for each sample.Where, ( ) is the predicted amount from proposed ensemble method, on sample t, and, ( ) is estimator , from estimators that have best results on the K neighbors, and is the weight.( ) is defined in equation 1.
) It should be noted that weight has an inverse relationship with distance, hence, should be reduced by increasing the sample distance, also, for the weight should we have: ∑ = 1 .In this paper, first, weights are sorted in descending order, then, the relationship between weights calculated as follow, where is an array of distances.A proposed weight is defined in equation 2.

Principal Component Analysis
Finding patterns in data with high dimensions is very difficult; therefore, data dimension reduction methods are used to reduce the dimension of data.PCA method is one of the most powerful methods for this purpose [20].With this method, the data dimensions is reduced and also do not ignored a lot of information.Consequently, the estimator can obtain higher accuracy too, by reducing the feature space complexity. (3) Where, is the covariance; and , are the mean of the variablesX and Y. Also, N is the number of samples.Equation4 is used to calculate the covariance matrix, in which, showsthe covariance matrix, and it is a symmetric matrix.
(4) , = ( ( , ), ( , ) = ( , )) Since, the covariance matrix is a square matrix, eigenvector and eigenvalues are extracted from, covariance matrix.On the other hand, eigenvectors are perpendicular on each other, and length of each is equal to 1.
(5) = , ( − ) = 0 (6) det( − ) = 0 In equations 5 and 6, A, is the covariance matrix, and λ, are amount of eigenvalues, and shows eigenvectors, and shows the identity matrix.Also , shows the determinant of a matrix.After calculating the Eigenvalues and eigenvectors, amounts of eigenvalue, indicates the degree of importance, and available information in the eigenvector.Consequently, by ignoring components, with smaller amounts, we can reduce the dimension of the data, appropriately.

KNN Algorithm
KNN algorithm, is a kind of methods of Lazy learning and instance-based learning, and is able to find a variety of applications in other areas, such as data processing and image processing [21].This algorithm relies on the principle of similarity between neighboring samples.In this section, K is a systematic parameter which should be tuned, before.In the KNN method, if the existing sample in database is identified with an equation, and d, be size of samples in feature space, then we get: (7) ( , ), ( , ) … , ( , ) ∈ , ∈ Then, by defining a dissimilarity measure, we can find difference, or distance between two samples in the feature space.Different methods have been proposed for dissimilarity measure, that, in short, it can be written as equation 8. To show the nearest samples in the feature space, we assume the feature space, two-dimensional, and also make the estimation problem into a two-class problem, for simplicity's sake.Nearest neighbors, for samples that were identified, with black color, and placed in the test set, for k = 3, is in Figure 2.
Figure 2. Showing NN-3 in two-dimensional space

Experimental Results
In this paper, as stated above, the 3 projects in the article [14], have been used.Available data, in this article have risen with factor 10, in the form of interpolation techniques, and also by the random factor.Three testsets are designed based on three presented data sets, which titled as EX1 and EX2 and EX3.These experiments were designed, fully dynamic, during the progress of the projects.For example, in EX3, it is assumed that the company has done projects I and II, and has started to do the third project.Hence, a data set from first two projects has been formed.Estimation done in beginning of each month, with the progress of the project, and third project information also adds to dataset, dynamically, during its progress.

Test Number 1
First, Figure 3 shows a cumulative distribution of available information in the connected components for EX3, to determine the number of used components in this project.According to the figure, by using the first three components, we can use almost all available information, in the features, for estimating.As a result, by using 3features rather than 14 features in estimating, the complexity of feature space becomes lower, and thus, the accuracy increases, obviously.Table 2 shows amount of mean of MSE for different methods.Also dynamic process of development is shown in the figure, where we effort to show, different estimates for different months, in it.In Table 2, the mean values in the different phases of the project are located.According to the table, we can see that the accuracy of single model methods is changed extremely, with change in their data sets, so that, neural network is the best estimator in EX1,while in EX3, it is the worst estimator.However, the proposed method, by changing in dataset, coordinates itself, with the best existing local estimator in ensemble.According to the table, there is relatively little reduction in accuracy of the proposed method, by increasing the number of neighborhoods.According to the figure, the project includes 32 steps, which are divided as follow; project A divided into 12 stages (months), B divided into 11 stages, and C divided into 9 stages for estimation.This division is presented based on the number of divided months, for 3 projects in [14].So that, the first 12 steps, are relating to EX1, and the next 11 steps are related to EX2, and also the last 9 steps are related to EX3.

Test Number 2
In test number two we attempt to change data, this work, causes to the accuracy of each method in ensemble have change.The results of the second part of experiment are shown in Table 3.This change in this case, is that the method used in [14], in forming dataset, has changed.Due to the table, we can see that the accuracy of single model methods could changes dramatically by change their dataset, so that neural network is the best estimator in EX1, while, in EX3, it is the worst estimator.However, proposed method with changing the dataset tune itself with the best existing local estimator in ensemble.Due to the table, in the proposed method, increasing in the number of neighborhoods leads to a little increasing in proposed method accuracy.Also the process of estimation for each month is specified, in the figure 6.

Conclusions
In this paper, a method based on mixture of experts is presented, in which the final result is calculated by the proposed fusion method.KNN method is used to find the best estimators in the neighborhood.Finally, a fusion method is proposed for combination of different estimation.Diversity increases with increasing in deferent learning, and this increase in diversity can increase generalization in final decision.It causes, if the number of estimators, in the test set has increase, or become more complex, not only the accuracy will increase on the training samples, but also this increase in accuracy is obvious in samples testing.As a result, with using more complex learning machines, in ensemble, it can be expected, that the results become far better.The results show that the proposed method has achieved greater accuracy in both tests, compared to each of existing samples in ensemble. - the proposed method are shown in Fig 1.

Figure 1 .
Figure 1.Flowchart of the proposed method

Figure 3 .
Figure 3. Cumulative distribution graphs, for percentage of useful information for different features

Figure 5 .Figure 6 .
Figure 5.The diagram of four different estimators in the first experiment, at different stages, with different k.

Table 2 .
Mean of accuracies for the first test for different methods of three different data sets

Table 3 .
Mean accuracy of the first experiment for different methods in three different dataset