Self-Organizing Map Learning with Momentum

Self-organizing map (SOM) is a type of artificial neural network for cluster analysis. Each neuron in the map competes with others for the input data objects in order to learn the grouping of the input space. Besides competition, neighbor neurons of a winning neuron also learn. SOM has a natural propensity to cluster data into visually distinct clusters, which show the intrinsic grouping of data. The self-organizing map algorithm is heuristic in nature and will almost always converge. Since self-organizing map may be trapped in a local optimum, so we introduce momentum into the learning process thus the movement of a neuron may jump over local optimum. We expect this will be similar to the learning of neurons in back-propagation with momentum. Like the learning process in back-propagation, the timing for updating the amount of movement of a neuron is either batch mode or incremental mode. However, due to the neighborhood function, the movement of a non-winner neuron is relatively small as compare to when it is a winner. So when deciding the momentum, the previous movement of a neuron needs special consideration. Experiment result show that adding momentum to self-organizing map considerably contributes to the acceleration of the convergence.


Introduction
Cluster analysis (Richard et al., 1990) groups similar data objects into clusters.Clustering algorithms differ in the data type that they can handle and the result that they produce.Some clustering algorithms handle multidimensional data and some handle any data format as long as the similarity among data objects can be derived.For example, hierarchical clustering algorithms need only the similarities among the data objects, and density-based clustering algorithms requires the distances among data objects.Only multidimensional data type can be processed by self-organizing map (SOM) clustering algorithms.K-means clustering produces exactly the specified number of clusters.Clusters produced by density-based clustering algorithms can be in arbitrary shape.Hierarchical clustering algorithms give a dendrogram without specifying the number of clusters.SOM clustering algorithms produce a map of neurons where each neuron is associated with varying number of data objects.
Self-organizing map (Kohonen, 1982) is a competitive learning based clustering neural networks.Neurons on the map compete with each other for each input data.Generally, the self-organizing map (Kohonen, 2001) requires thousands of learning iterations to converge in order to search the optimal solution.Due to the nature of learning and searching, SOM has the possibility to get stuck in a local optimum.So, we incorporate the momentum term into the learning process of SOM (Masafumi, 1996).The idea appears in some other methods, such as the back-propagation algorithm (Vogl et al., 1988).The momentum term effectively accelerates the learning of the back-propagation algorithm (Hagiwara, 1992).In this paper, we discuss the timing for updating the momentum term in the self-organizing map which is not mentioned in the work by Masafumi.Analog to the batch mode and incremental mode for updating the weight vectors in back-propagation algorithm (Mitchell, 1997), we propose a batch mode algorithm and an incremental mode algorithm for updating the momentum term.Momentum term is the product of momentum coefficient and the previous updating amount of a neuron.However, in SOM, a non-winner neuron updates its weight vector.If the non-winner neuron is not so close to the winner, the amount of update is very tiny.In order to keep a reasonable amount of momentum, the definition of momentum term is revised as the product of momentum coefficient and the momentum mass of a neuron.Momentum mass is defined on the amount of weight vector updating of a neuron.We propose a batch mode algorithm and an incremental mode algorithm to determine the momentum mass.We performed experiments and the result shows the quantization error decreases for SOM with momentum.

Related Work
Clustering (Kaufman, 1991) partitions a set of data objects into clusters.Objects in a cluster are similar to each another and are dissimilar to objects in other clusters.A good clustering method produces high quality clusters with high intra-class similarity and low inter-class similarity.Dissimilarity and similarity metric is expressed in terms of a distance function.The quality of clustering result depends on the similarity measure used by the method.Usually, for multidimensional data, Euclidean distance or Manhattan distance is adopted as the definition of distance function.
Major clustering methods can be classified into the following categories.The partitioning methods (Han, 2001) partition the given set of data objects into some groups.Then, the medoid of each group is computed.Each data object is re-assigned to a group whose medoid is closest to the data object.Then process iterates until there is no new re-assignment.A well-known partitioning method is k-means (Alsabti, 1988).The hierarchical methods (Carpinteiro, 2000) create a dendrogram of the given data objects.A hierarchical clustering method can be either agglomerative or divisive.Agglomerative hierarchical clustering algorithm initially treats a data object as a singleton cluster.Then, the algorithm iteratively merges two closest clusters into a cluster.There is a need to define the similarity between two clusters.The density-based method (Ankerst et al., 1999;Ester et al., 1996) is based on connectivity and density functions.A data object is called a core if there are enough neighboring data objects.Two parameters specified by the users are the minimum number of neighbors and the radius.Two data objects are called neighbors if the distance between them is within the radius.The grid-based method (Wang et al., 1997) quantizes the object space into a finite number of cells that form a gird structure.The model-based method (Dasgupta, 1998) hypothesizes a model for each cluster and finds the best fit of the data to the given model.

Self-Organizing Map
Self-organizing map is an extensively used method.It is a common type of artificial neural network in which the neurons learn iteratively through competing for the input data, and through the cooperation with the neighboring neurons.A particular feature of self-organizing map is that it typically reduces multi-dimensional data to a low-dimensional grid of nodes, making them easy to visualize the result.In the self-organizing map, each map node has a unique coordinate.It represents a cluster of input vector.Self-organizing map algorithm matches a given input vector to its best matching unit (Lippmann, 1987).The best matching unit acts on the optimal fit ant its neighbors using a neighborhood function.Self-organizing map uses Euclidean distance as the distance between a neuron and an input vector.For an input vector, the neuron having the shortest distance from the input vector among all the neurons is called the winning neuron.
The weight vector of the wining neuron is updated toward the input vector.And its topological neighbors also move toward the input vector in the input space.The weight vectors are updated according to the following equation: where t is the index of the current iteration and the regression is performed for each presentation of a sample of x, denoted x(t), and 0 < α(t) < 1 is the learning rate which decreases monotonically with the regression iterations.The ℎ * ( ) is the neighborhood function which is taken as the Gaussian where , * is the distance between the map locations j and j * , R(t) is the neighborhood radius.During an iteration, both α(t) and R(t) decrease exponentially.The decay functions for the learning rate and neighborhood radius are α(t) = α  exp(-t/T) and R(t) = R  exp(-t/T), where α, R, and T are the initial learning rate, the initial radius, and the maximum number of the iterations, respectively.
Generally, the learning process of neural networks is very slow, therefore self-organizing map requires thousands of iterations to converge.
The quantization error (QE) of a training sample is defined as the distance from the input vector to its best matching neuron in the self-organizing map.The convergence process measures the sum of quantization error.The sum of quantization error is following equation: where D is the set of input vectors, j is a neuron, i is the dimensional index of input vectors.
The momentum term effectively accelerates the learning of the back-propagation algorithm (Hagiwara, 1992).In order to effectively accelerate the reduction of quantization error, momentum term has been incorporated into the learning process of SOM (Masafumi, 1996).However, learning process was not thoroughly discussed.Analog to the batch mode and incremental mode for updating the weight vectors in back-propagation algorithm (Mitchell, 1997), we propose a batch mode algorithm and an incremental mode algorithm for updating the momentum term.

Learning with Momentum Term
Momentum term is the product of momentum coefficient and the previous movement of a neuron.The amount of movement of a neuron differs a lot, depending on whether the neuron is winner or not, and depending on the distance from the neuron to the winner.However, conventional self-organizing map with momentum term does not consider whether the previous movement is large or small.The movement of a neuron becomes previous movement immediately after the input vector is processed.We denote such conventional self-organizing map with momentum term as immediate mode momentum term algorithm.In the following subsections, we will propose a batch mode momentum term algorithm and an incremental mode momentum term algorithm.
Let D be the set of data objects.A grid size of N x  N y is considered as in standard self-organizing map.Let N x  N y = M be a set of neurons.Each neuron is associated with a weight vector which has the same dimensionality of input vectors.The elements in the weights are initialized with values between 0 and 1.

Momentum Term
The momentum term is used to speedup network training and to avoid local optimal.A neuron moves toward the location of an input sample that it wins.When the neuron or its neighbor wins an input sample, it moves along with its previous direction as well as toward the location of the new input sample.In other words, the previous moving direction has some effect on the movement of a neuron.The momentum term of such effect is determined by a coefficient and the previous amount of movement of the neuron.The coefficient is called momentum coefficient, which is between 0 and 1.The amount of movement of a neuron can be rewritten as the following equation: Update neuron's weight vector: where m(t) is the momentum coefficient at iteration t.
Typically neurons in a network use the same momentum coefficient.Momentum coefficient usually changes from iteration to iteration.In most cases, momentum coefficient decreases over time.Momentum coefficient decreases with the formula m(t) = m  exp(-t/T), where m is the initial value of the momentum coefficient.The initial momentum coefficient affects the learning process, so we use two values, i.e., 0.5 and 0.8, in the experiments to observe the effect.
SOM with momentum term was proposed by Masafumi.The algorithm always uses the previous amount of update of a neuron to compute the momentum term, no matter the neuron was a winner or not.Also, the momentum coefficient is fixed.The algorithm is as follows: 1 Initialize grid size N x and N y , learning rate α, neighborhood radius R, momentum coefficient m and the number of iterations T.
2 for each iteration t in T do 4 for each input data x in D do 5 find the winning neuron.Let the winner neuron be j*.

6
for each neuron j 7 calculate ΔW j (t) using equation ( 4) 8 update neuron's weights, using equation ( 5) 9 end 10 end 11 end 3.2.2Batch Mode When considering the momentum for the learning of a neuron, the momentum is the product of momentum coefficient and the previous movement of the neuron.In the learning of a neuron in back-propagation, the momentum is determined by the momentum coefficient and the previous movement of the neuron.The previous movement is somehow steady, so there is no special need to consider whether the previous movement is reasonable.However, in SOM, the previous movement of a neuron might be large if the neuron was the winner for the previous input vector, and might be small or even no movement due to neighborhood function if the neuron was not the winner.In average, the probability that a neuron wins two consecutive input vectors is 1/n, where n is the number of neurons.In other words, in most of the time, if there is no special treatment on this previous movement of a neuron, it will be near none, which results in no momentum effect.
We denote this previous movement of a neuron for computing momentum as movement mass.So, the momentum is the product of momentum coefficient and movement mass.
For discussing the batch mode momentum term, we need the concept of iteration of learning.An iteration of learning process is that each input vector in the dataset is processed once.Each neuron accumulates its own movement.The movement of a neuron is accumulated only when it is the winner for the input vector.At the end of an iteration, in other words, after all the input vectors have been processed once, the accumulated movement of each neuron is averaged by the number of input vectors won by the neuron.Then, the averaged movement of each neuron becomes the new movement mass of the neuron.Therefore the amount of weight values can be rewritten as the following equation where ∆ ( − 1) is the average ΔW j (t -1) when the neuron j won in the (t-1) th iteration.If a neuron did not win any input vector in previous iteration, this momentum term is 0.
Its algorithm is as follows: 1 Initialize grid size N x and N y , learning rate α, neighborhood radius R, momentum coefficient m and the number of iterations T.
2 for each iteration t in T do 4 for each input data x in D do 5 find the winning neuron.Let the winner neuron be j*.

Incremental Mode
In this incremental mode algorithm, the amount of movement mass of a neuron is updated immediately after the neuron wins an input vector.Therefore the amount of weight values can be rewritten as the following equation: where j * is the winning neuron, ∆ ( − 1) is the movement mass.The movement mass of the winning neuron is updated, i.e., ∆ ( ) = ∆ ( ).
Its algorithm is as follows: 1 Initialize grid size N x and N y , learning rate α, neighborhood radius R, momentum coefficient m and the number of iterations T.
2 for each iteration t in T do 4 for each input data x in D do 5 find the winning neuron.Let the winner neuron be j*.

Measurement Method
We carry out the studies on the impact of momentum term.Two convergence measures are used.The first measure is to calculate the quantization error (QE) of an input vector.QE is the Euclidean distance of the input vector and the beast matching neuron.Sum of the squared QE is minimized for given training vectors.
Another convergence measure method assumes the number of clusters is known in advance.Let k be the number clusters in the following discussion.The size of a neuron is defined as the number of input vectors won by the neuron after the final iteration.The method selects the largest k neurons, called seed neuron.Each input vector won by the small neurons is assigned to the closest seed neuron.Input vectors are now partitioned into k clusters.Centroid of each cluster is computed.The convergence measure is sum of the squared error (SSE).The SSE as following equation: where k is the number of clustering, C i is the i-th cluster, m i is the clustering center, x is the data.
The SSE is to calculate all data points corresponding to the sum of its distance from the cluster center and differences.We repeat all experiments 30 times and then take the average of the 30 runs.

Data Sets
In this paper, we use two data sets, Iris and Wine, from the UCI machine learning repository (Newman et al., 1988) to carry out experiments.

Iris
The iris data set consists of four dimensions with small variance.The dimensions are the petal length, the petal width, the sepal length, and the sepal width.Each object belongs to one of the three classes.There are 150 training instances in the data set.Each attribute is normalized with z-score method.The evaluation of QE was carried out for self-organizing map (15 × 15 = 225 model and 3000 iterations, respectively.)The neighborhood function of the self-organizing map was a two-dimensional Gaussian.Figures1-4 show the QE and SSE which are plotted as the function of iteration, where learning rate α = 0.01, α = 0.001, and α = 0.0001, momentum m = 0.5 and m = 0.8, respectively.In each of the Figures, we compare the performance of the four methods, i.e., the original self-organizing map algorithm without momentum term, self-organizing map with immediate mode momentum term, self-organizing map with batch mode momentum term, and self-organizing map with incremental mode momentum term.In the Figures, the methods are denoted as w/o momentum, immediate mode, batch mode, and incremental mode.

Wine
The measurements consist of thirteen dimensions.The class attribute has 3 values.There are 178 instances in the data set.Each attribute is normalized with z-score method.The evaluation of QE was carried out for the self-organizing map (10 × 10 = 100 model and 3000 iterations, respectively.)The neighborhood function is a Gaussian centered in the winner neuron.The neighborhood size, learning rate and momentum coefficient decreases exponentially during training.Experiment was conducted on original self-organizing map without momentum term, self-organizing map with immediate mode momentum term, self-organizing map with batch mode momentum term, and self-organizing map with incremental mode momentum term which were trained with 3000 iterations.After learing, Figures 5 to 8 According to the experimental results shown in Figures 5 and 6, when the learning rate is low, there is slow convergence value.Whereas, when the learning rate is high, the learning process converges faster.At the same learning rate, the convergence effect with momentum coefficient =0.8 is significantly higher than that with momentum coefficient =0.5.With the same momentum and learning rate, the quantization error in incremental momentum has the lowest value.According to the experimental results shown in Figures 7 and 8, we discuss the relationship between momentum coefficient and learning rate.Since Wine contains more complex thirteen dimensions, from the Figure (a), the learning rate = 0.01, when added the momentum term, there is a higher SSE value.From Figure (8.c), the learning rate = 0.0001, the added momentum term can effectively reduce the value of SSE.
In the experiment, we compared the original self-organizing map without momentum term, self-organizing map with immediate mode momentum term, self-organizing map with batch mode momentum term, and self-organizing map with incremental mode momentum term.The results suggest that the learning process by adding the momentum term can accelerate the convergence, especially in the incremental mode.

Conclusion
Momentum term has been incorporated into the learning process of SOM.Momentum term of a neuron is the product of mememtum coefficient and the previous movement of the neuron.However, the previous movement of a neuron may be very small if (1) the neuron was not the winner of the previous input vector, and (2) the neuron is far away from the winning neuron of the previous input vector.Thus, the effect of momentum term in learning process will be lost.In this paper, we discuss this phenomenon and propose two modes for updating the prvious movement of a neuron when computing the momemntum term.
For discussion convenience, we temporarily coined a term, momentum mass, to denote the previous movement of a neuron.Thus, momentum term of a neuron is the product of mememtum coefficient and momentum mass of the neuron.The batch mode momentum term algorithm updates momentum mass of all neurons after an iteration of all input vectors.The new momentum mass of a neuron is the average movement of the neuron when it wins the input vectors in the iteration.The incremental mode momentum term algorithm updates momentum mass of a neuron after it wins an input vector.
Experimental results show that SOM with momentum term can accelerate convergence.Incremental mode momentum algorithm has the most obvious effect.
Figure 2. Quantization error of results against learning rate and momentum coefficient =0.8 in the Iris data Figure 7. Quantization error of results against learning rate and momentum coefficient =0.5 in the Wine data Figure 8. Quantization error of results against learning rate and momentum coefficient =0.8 in the Wine data