Clustering Applied to the Education: A K-means and Hierarchical Application

Currently, most schools in the world use ICT, which is why students must make use of computers and mobile devices in and out of schools. Thanks to the use of technology, students are more interested and motivated to learn, considering that motivation is one of the main engines of learning, since it encourages activity and thought. On the other hand, motivation makes students spend more time working and therefore they are more likely to learn more. The aim of this paper was to present a clustering of European countries according to the number of desktop computers available to students in primary schools (ISCED 1), lower secondary schools (ISCED 2) and upper secondary schools (ISCED 3). Was used the database developed by the ES Open Data Portal for the year 2019 on "ICT in Education". For the classification were used the hierarchical clustering and K-means techniques and the statistical software Rcran 3.6.3. These techniques were used as they have the ability to group a large number of elements into clusters, based on the similarity learned. This paper concludes that the countries with the highest GDP are not the ones that have the most desktop computers in their schools. Bulgaria is the country with the major number of desktop computers in their schools.


Introduction
The use of ICT in education allows the generation of new communication channels between teachers and students, encourages collaborative work, promotes reflective teaching practices, methodological updating and the acquisition of digital skills (Palomino, 2017). Thanks to the ICT use, students are permanently predisposed to interact with the computer. The versatility and interactivity offered by the ICT, the possibility of interacting virtually with other people and the large volume of scientific information available on the internet, is attracting more and more university students worldwide.
strategies, in which the use of ICT is integrated, as has been demonstrated in theoretical models (Ferná ndez -Batanero, et al., 2019). The current European education model has led to innovations in the teaching-learning process in higher education, including the incorporation of ICT. Considering the variety of learning styles that students may have, it is important that teachers personalize teaching methods and develop ICT skills (Vega-Herná ndez, Patino-Alonso, & Galindo-Villardón, 2018). In the European Higher Education Area (EHEA), efforts have been made to promote the use and incorporation of ICT in European Universities, emphasizing the motivational and cognitive components underlying the learning process (Valentí n et al., 2013). This paper is divided as follows: the first part presents a general contextualization of the most used clustering techniques in data analysis (hierarchical and k-means). Then the methodology used in the classification and the code used in the software R Cran 3.6.3. Finally, presents the classification results, the analysis and discussion. This paper concludes that some countries make a greater use of ICT than others and the uses differ from each other, Bulgaria is the country with the major number of desktop computers in their schools.

Clustering
The main task of data mining is to extract useful information from large volumes of information. Clustering is one of the most common techniques in data mining, it allows the natural discovery of groups of similar observations. In this case, the hierarchical and k-means techniques was used to group European countries according to the number of desktop computers available in the schools. This grouping can serve to identify countries that have the least number of computers in schools, so that they may be motivated to define strategies to increase these resources, considering that they are fundamental tools in education in the XXI century.
Clustering consists of an exploratory data analysis technique used to group objects that have similar characteristics. Its main use is to discover patterns that are difficult to perceive and to generate information that can contribute to decision making (Govender & Sivakumar, 2020). As mentioned above, this technique has been used to group similar data sets, observations, vectors, etc. (Jain et al., 1999). In short, it is the process of identifying groups of data (Kaufman & Rousseeuw, 2009). In fact, the objects that make up a group have a greater degree of similarity than the objects that make up other groups (Govender & Sivakumar, 2020) and also facilitates the identification of distributions and patterns of interest, making it easier to understand the underlying structure of the data (Halkidi et al., 2001). This technique was proposed in 1930, but recognition of its usefulness began in the early 1960s. There are many areas of knowledge that have used it; natural sciences, health sciences, social and human sciences, etc. (Gong & Richman, 1995).

K -Means Clustering
Unsupervised learning algorithm used to group a data set into a number (K) of groups defined by the researcher (MacQueen, 1967). Classify objects into different clusters so that objects within the same cluster are as similar as possible, while objects in different clusters are as different as possible. In the k-means clustering, each cluster is represented by a centroid that corresponds to the average number of points assigned to the cluster (Kassambara, 2017). In summary, this clustering technique associates objects in a way that minimizes variation between them (Jaiswal, 2018). Among the different types of existing average K algorithms, the most used is the one that defines the variation within the group as the sum of the squared Euclidean distances between each element of the group and the centroid (Hartigan & Wong, 1979): (1) Where x i is data belonging to the C k cluster and μ k is the average value of the data assigned to the C k cluster. Each data x i is assigned to a given cluster so that the sum of the squares of the distance of the observed data to the center of the cluster assigned μ k is minimal (Jaiswal, 2018). Briefly, the step-by-step of this algorithm is:  Determine the number of clusters (k) to be created.
 Randomly select (k) objects from the initial data set as the plants of the clusters.
 Assign the nearest centroid to each remaining data set, based on euclidean distance.
 For each cluster, recalculate the centroid by finding the average of all the data in the cluster.
 Iteratively minimize the total within the sum of the squares. Repeat step 3 and step 4, until the centroids do not vary or the maximum number of iterations is reached.
The total variation within the cluster or the total of the sum of squares is defined in the following equation: (2) This is the sum of all the clusters over the sum of the Euclidean distances squared between the data and the corresponding centroid (Kassambara, 2017

Hierarchical Clustering
This clustering technique is more complete than the previous technique, as it generates a dendrogram (partition tree) (Xu et al., 2020). By cutting the dendrogram generated by a hierarchical grouping algorithm, different grouped results can be obtained without using it again (Xu et al., 2020). Hierarchical clustering (HC) has a temporal complexity of O(2n), uses reciprocal close neighbors and reproducibility. HC starts from n clusters and successively merges the most similar ones to form a larger one. This process is repeated until the number of clusters equals the desired number. Unlike the previous technique, hierarchical clustering does not require predefined parameters. Because of this, it is the most suitable clustering technique for handling real data where setting parameters is a complex task (Xu et al., 2020;Bouguettaya et al., 2015).
The key principle in this grouping technique is to repeatedly combine the two closest groups into one larger group (Pathak, 2018). The main steps of this algorithm are:  Find the distance between each pair of points in the data set and store it in a distance matrix.
 Locate each point in its own cluster.
 Merge the closest pairs of points based on the distances from the matrix. The number of clusters is reduced by one unit.
 Re-calculate the distance between the new cluster and the old ones and store them in a new distance matrix.
 Repeat steps 2 and 3 until all the clusters are integrated into one.
Linking methods measure the distance between the clusters to decide the grouping rules. Among the different methods available are full (finds the maximum distance between clusters before merging), single (calculates the minimum distance between clusters before merging), medium (calculates the average distance between clusters before merging) and centroid (finds the centroid of cluster 1 and the centroid of cluster 2, and then calculates the distance between them before merging) (Pathak, 2018). Table 1 presents the some commonly used metrics for this clustering technique:

Methodology
The analysis used the results of the survey conducted by IPSOS and Deloitte on the use of ICT in education in 2019. This survey aimed to identify information on use, access and attitudes towards ICT use of students, parents and teachers in 31 European countries. The survey was conducted in primary schools (ISCED 1), lower secondary schools (ISCED 2) and upper secondary schools (ISCED 3). The survey contained different themes: access and use of ICT, digital activities carried out by students and teachers, teaching skills and competences, use of ICT at home, digital policies and strategies. The survey was applied to a representative sample of 8,749 educational institutions. Table 2 presents a summary of the variables:

Data and Discussion
For the purposes of this paper, only information related to the number of desktop computers in laboratories, classrooms, school libraries and other locations that were accessible in the schools was used:

Application of K Means Clustering
This technique subdivide the data sets into a set of k groups, where k is the number of groups pre-specified (Kassambara, 2017). The next two R Cran 3.6.3 libraries were used: library(cluster) library(factoextra) For determinate the optimal number of clusters was used the function: fviz_nbclust(europe, kmeans, method = "gap_stat") The graph shows that the suggested number of clusters is 4. For determinate the countries in each cluster was used the function: set.seed(123) # for reproducibility km.res <-kmeans(europe, 4, nstart = 25) # Visualize fviz_cluster(km.res, data = mydata, palette = "jco", ggtheme = theme_minimal()) Figure 1 presents the cluster plot: Cluster 1 (Bulgaria), Cluster 2 (Croatia, Greece, Germany, Iceland, Belgium, Romania, Denmark and Spain), Cluster 3(Slovakia, Portugal, Poland, Austria, Finland, Malta, Italy, Cyprus and Slovenia) and Cluster 4 (Estonia, UK, Norway, Ireland, France, Luxembourg, Sweden, Czech Republic, Hungary, Lithuania, Latvia, Netherlands and Turkey). Figure 3 presents the European political map: The K-means clustering results show that Bulgaria is the only country that was classified in an independent cluster. In effect, this country has the major number of desktop computers in laboratories, classrooms, school library and other locations that were accessible in the schools, which makes it possible to say that in this country investment in technologies for education is good. The second, third and fourth clusters grouped countries that are not together geographically but have a similar number of computers in the schools (laboratories, classrooms, school library and other locations). The four European countries with the major Gross Domestic Product (GDP) in 2019 were Germany, UK, France and Italy (Statistics Times, 2020), of these countries only UK and France were classified in the same cluster. All clusters have a different number of countries.

Application of Hierarchical Clustering
This technique doesnt require to pre-specify the number of clusters to be generated and the result of is a dendrogram (Kassambara, 2017). The dendrogram is the graphical representation of the clustering. Usually, it is drawn backward, starting from the final cluster with all the objects and from similarity 0 (Forina et al., 2002). The functions hclust(), dist() and fviz_dend() were used.
res.hc <-hclust(dist(europe), method = "ward.D2") fviz_dend(res.hc, cex = 0.5, k = 4, palette = "jco") Figure 4 presents the cluster dendrogram:  Vol. 12, No. 3; Cluster 1 (Sweden, France, Hungary, Luxembourg, Ireland, Netherlands, Turkey, Czech Republic and Latvia), Cluster 2 (Norway, Estonia, UK, Cyprus, Italy, Malta, Poland, Slovakia, Austria and Finland), Cluster 3 (Bulgaria) and Cluster 4 (Slovenia, Spain, Portugal, Croatia, Iceland, Lithuania, Denmark, Romania, Belgium, Germany and Greece). Similar to the K-means clustering, the hierarchical clustering results show that Bulgaria is the only country that was classified in an independent cluster and all clusters have a different number of countries. In this classification two (UK and Italy) of the four european countries with the major GDP were classified in the same cluster. Countries belonging to the same cluster have a similar number of desktop computers in their schools. Additionally, a heatmap was built. This is another way to visualize hierarchical clustering and consist in a graphical representation of data that uses a system of color-coding to represent different values (Lommatsch, Tucker, Moyer-Packenham, & Symanzik, 2018). The columns/rows of the data matrix are re-ordered according to the hierarchical clustering result, putting similar observations close to each other (DataNovia, 2018).
The R package pheatmap was used: library(pheatmap) pheatmap(t(mydata), cutree_cols = 4) In each category (classrooms, laboratories, libraries and other locations), the blue, yellow, orange and red colors represent quantities. The dark blue color represents an amount close to 10.000.000 desktop computers and dark red color 60.000.000. This result shows that only one country has around 40.000.000 desktop computers in other locations (Bulgaria). In the majority of countries analyzed, the number of desktop computers in classrooms, laboratories and libraries was represented with blue color, i.e. the number in these places does not exceed 30.000.000 desktop computers.

Conclusions
The aim of this paper was to presents an application case of the clustering techniques, in particular the hierarchical and k-means clustering in the ICT use in European schools. Bulgaria is the country with the major number of desktop computers in their schools. The countries with the highest GDP are not the ones that have the most desktop computers in their schools, which may be due to the use of laptops. Overall, this paper intends to provide researchers with a brief guide to applied cluster analysis methods (hierarchical and k-means) to some dataset. The database contained sufficient information; the application of clustering techniques allowed obtaining valid information. Future research may use clustering techniques to analyze other aspects of ICT use in European schools.
In fact, Bulgarian schools provide conditions enabling the technology-enhanced teaching process, adequate to necessities of today's digital society. These include modern ICT infrastructure and resources, as well as opportunities for teachers to improve their ICT competence (Terzieva et al., 2019). The main reasons for the low number of desktop computers in schools in some European countries is that their use is concentrated mainly on the sporadic retrieval of information from the Internet. Only a small number of teachers use computer-assisted teaching and learning materials regularly, as they report difficulties in integrating them into classroom practice, problems in allocating time for training and low ICT knowledge and skills. Furthermore, they lack the structural support as well as the infrastructure to find the most effective ways to apply ICT in the teaching process (Welzel et al., 2010).