A Class Validation Proposal of a Pedagogic Domain Ontology based on Clustering Analysis

The knowledge bases of the Web are fundamentally organized in ontologies in order to answer queries based on semantics. The ontologies learning process comprises three fundamental steps: creation of classes and relationships, population and evaluation. In this paper the focus includes the classes creation, by introducing a class validation proposal using clustering analysis. As case of study was selected a pedagogical domain, where a corpus was semi-automatically built, from articles written in Spanish published in Social Sciences. Moreover, a dictionary containing classes, concepts and synonyms was included in the experiments. Clustering analysis allowed to verify the concepts that the experts considered as the most important for the domain. For the case of study selected, the cluster analysis step reports clusters with the same instances that the clusters defined by the experts.


Introduction
Nowadays, the available information has increased exponentially, therefore it is necessary to propose novel techniques or processing that information and use them with different scientific objectives.The ontologies can be used for purposes such as structure knowledge in taxonomies, vocabulary management, natural language processing applications, searches, recommendation systems, e-learning among others (El-Ansari et al., 2016).
The ontology learning process integrates the class detection, creation, population and evaluation.This process is applied in different researches such as Fu et al. (2008), García et al. (2010) and Ochoa et al. (2011).The ontology learning process needs resources compatible with the research.This paper is focused on the initial two steps of ontology learning process: the corpus analysis and principal classes detection.A dictionary containing principal concepts and a corpus formed with pedagogical papers in Spanish language were built in the process.Pedagogical domain is extensive, thus the research is focused on the creation of tools that support the classes of the instructors in the classroom.Three topics, according tha opinion of experts, were widely researched: learning styles, intelligences types and learning strategies in order to be considered as the main classes of the ontology.
The article is organized in eight sections described as follows.Section 2 introduces the ontology learning problem and the ontology formalization process.Section 3 describes the topics addressed in the corpus, section 4 discusses the clustering methods and the available evaluation metrics being used for experiments.Section 5 presents related work about the pedagogical domain mainly and some other domains; besides works done on ontology learning process are also discussed.Section 6 present the proposed methodology and section 7 shows the analysis of results.Finally, section 8 presents conclusions and future work of the research.

Ontology Learning
Ontology is a philosophical discipline which can be described as the science of existence or the study of being.In modern computer science parlance, an ontology is a formal and explicit specification of a shared conceptualization of an interest domain.Their classes, relationships, constraints and axioms define a common vocabulary to share knowledge (Guarino et al., 1999).This is such as a data base, and is defined with an initial corpus where the principal components or keywords are extracted.Afterward, the relationships between keywords are inferred and a graph structure is created (the keywords are the nodes and the relationships are the edges).
Ontologies can be used for purposes such as structure knowledge in taxonomies, vocabulary manage, natural language processing applications (El-Ansari et al., 2016), searches (Celjuska & Vargas-Vera, 2004), recommendation systems (Dai & Li, 2010), and e-learning.Ontologies can model interaction systems between users and their environment, since to its property to manage complex knowledge in reusable formal representations.Formally, ontology can be defined such as (Faria et al., 2014): Where:  C is the set of entities of the ontology  H is the set of taxonomic relationships between concepts  I is the set of instance relationships related to the C  R is the set of non-taxonomic ontology relationships  P is the set properties of ontology entities  A is the set of axioms, rules that allow checking the consistency of an ontology and infer new knowledge through some inference mechanism The process called ontological learning is carried out to generate knowledge and includes the following tasks (Cimiano, 2006): 1. Acquisition of relevant terminology The input is an ontology and a corpus where the candidate instances are identified.Using a classifier, the instances are labeled with a class and finally, the output is the ontology populated (Faria et al., 2014).
The mainly problem faced in ontology learning process is to determine which ontologies would best suit a particular problem, hence the selection of an evaluation technique is mandatory.An interesting aspect about evaluation, to an opposed to information retrieval and other areas, is ontologies are not an end product but means to achieving some other tasks.In this sense, an evaluation approach is the fact to also useful to assist users for choose the best ontology that fits their requirements when faced with a multitude of options.In ontology learning process, we can not simply measure how well a system constructs an ontology without raising more questions (Wong et al., 2011).Instead of it begins with some questions about ontology evaluation: is the ontology good enough?If so, with respect to what application?An ontology is made up of different layers, such as terms, concepts, and relations.

Pedagogical Domain
Since pedagogical domain is extensive, the aim in this work is just the creation of support tools for the instructors in the classroom.Three topics were researched: learning styles, intelligences types and learning strategies in the class.

Learning Styles
Learning styles project the way of which a person learn.However, there exist alternatives about how is possible to learn concepts and processing information by humans.Several theories to describe the different types of learning have been proposed in some works.This work adopts as a reference the David Kolb model (Kolb, 1976), where a learning style is determined using the Learning Style Inventory (LSI) scale.The theory proposes a method for describing how students solve problems and apply new knowledge from personal experience within their learning environment.It considers the psychological processes of perception and processing (Olivos et al., 2016).This method comprise four learning styles as follows:  Active: It includes people who get involved with new experiences, they tend to act first and think the consequences after.


Reflective: It includes people who are observers and analyze their experiences from different perspectives.They collect and analyze data in detail before take a conclusion.


Theoretical: People who adapt and integrate their observations into a complex and well logically founded theories.They prioritize logic and rationality before analysis and synthesis.
 Pragmatic: Includes people who test their ideas, theories and new techniques, and try to see if they work in practice.They dislike the long discussions on the same subject.They are practical and attached to reality.

Intelligences Types
Intelligence is the ability to solve problems, or to create products, that are valued within one or more cultural settings (Gardner, 2001).Humans have a capacities and potentials that can be employed in productive ways (together or separately).This idea originated the multiple intelligences theory.The types of intelligence identified in Gardner ( 2001) are described below:  Linguistic intelligence involves sensitivity to spoken and written language, the ability to learn languages, and the capacity to use language to accomplish certain goals.


Logical-mathematical intelligence consists of the capacity to analyze logically problems, carry out mathematical operations, and investigate issues scientifically.


Musical intelligence involves skill in the performance, composition, and appreciation of musical patterns.It encompasses the capacity to recognize and compose musical pitches, tones, and rhythms.


Bodily-kinesthetic intelligence entails the potential of using whole body or parts of the body to solve problems.It is the ability to use mental abilities to coordinate bodily movements.
 Spatial intelligence involves the potential to recognize and use the patterns of wide space and more confined areas.


Interpersonal intelligence is concerned with the capacity to understand the intentions, motivations and desires of other people.It allows people to work effectively with others.


Intrapersonal intelligence entails the capacity to understand oneself, to appreciate feelings, fears and motivations.

Learning Strategies
A learning strategy is a set of procedures that a learner uses consciously, controlled and intentional as flexible tools to learn and solve problems (Barriga & Hernández, 2004).Figure 1 shows the types of strategies published by González and Valle (2011).
Figure 1.Types of learning strategies

Cluster Methods
Clustering is the process of grouping a set of data objects into multiple groups, so that objects within a cluster have high similarity, but are dissimilar to objects in other clusters.Dissimilarities and similarities are assessed based on the attribute values of the objects and distance measures (Han, 2005).Table 1 shows the main characteristics of each cluster algorithm used in this work, and these are explained follows:


The KMeans algorithm creates subsets of the original input data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within sum of squares.This algorithm requires the number of clusters to be specified.It scales well to large number of samples and has been used across a large range of application areas in many different fields (Arthur & Vassilvitskii, 2007).


The Agglomerative clustering object performs a hierarchical clustering using a bottom up approach: each instance starts in its own cluster, and clusters are successively merged (Pedregosa et al., 2011).


The Birch method builds a tree called the Characteristic Feature Tree (CFT) for the given data.The data is essentially lossy compressed to a set of Characteristic Feature nodes (CF Nodes).The CF Nodes have a number of subclusters called Characteristic Feature subclusters (CF Subclusters) and these CF Subclusters located in the non-terminal CF Nodes can have CF Nodes as children (Zhang et al., 1996).
 Spectral Clustering defining an affinity matrix between samples, followed by a KMeans in the low dimensional space.Spectral Clustering requires the number of clusters to be specified.It works well for a small number of clusters but is not advised when using many clusters (Luxburg, 2007).

Evaluation Metrics
Clustering evaluation assesses the feasibility of clustering analysis on a data set and the quality of the results generated by a clustering method.The tasks includes assessing clustering tendency, determining the number of clusters, and measuring clustering quality (Han, 2005).The metrics used are described in the next paragraphs (Pedregosa et al., 2011).
Mutual Information.It is a function that measures the agreement of two assignments, ignoring permutations (labels true and labels predict).The range of results is from 0 to 1, where 1 is the perfect match.The value of the mutual information is not adjusted by chance and will tend to increase as the number of different clusters increases regardless of the actual amount of mutual information between the label assignments.The general formula is defined in 1. ( Where U and V are the labels, actual and predicted by the clustering algorithm, respectively.This concept uses the entropy to perform the calculation of the metric.
Rand Index.It is a function that measures the similarity between the true tags and those obtained by the clustering algorithm; is symmetric, so it does not affect the order in which labels are processed.The range of results goes from -1 to 1, where it negative values are considered bad and 1 is the perfect similarity.If C is a classification task and k is the result of applying a cluster algorithm, a and b are defined as:  a: The number of pairs of elements that are in the same set in C and in the same set in K.

 b:
The number of pairs of elements found in different sets in C and in different sets in K.
The Rand index is given by 2.
(2) Where the denominator indicates the total of possible pairs in the data set to be sorted.
Homogeneity.It is the metric in which each group contains only the members of a single class, this concept is complemented by exhaustiveness, in which all members of a given class are assigned to the same group.The values of homogeneity are between 0 and 1 and its mathematical formula is given by 3. ( Where H is the conditional entropy of classes.
Fowlkes.This score is defined as the geometric mean of precision and recall 4. (4) Where TP, FP, and FN are the true positives, false positives and false negatives respectively.

Related Work
In the next subsections, some works about pedagogical domain will be presented.Furthermore, some works focused in corpus creation, class extraction and population applied in others domain will be addressed.

Pedagogical Domain
In Wu (2008) an ontology based on internet education system, which implements the sharing and reusing of learning material in some systems is developed.It is a qualitative research, where an example with a basic computation online course is created describing the system modules: learning, interface and resources.
In Zhu et al. (2008) ENGOnto is presented; the ontology integrates multiple relevant ontologies for personalized agents to deal with dynamic changes of students learning process, interaction between instructor and learning resources in the environment of English language education.The ontology was built manually but authors describe the process of generating knowledge points dependency (class integration).In Zhu and Yao (2009) a learning activity sequencing approach based on ontologies and the learner activity graph in collaborative environments is described.The system is based on ontology, key technologies of the system include: ontology-based knowledge representation and ontology-based knowledge retrieval.
A domain ontology structure who plays an important role for representing higher education concepts and for assist specialized e-learning systems is addressed in Bucos et al. (2010).As the number of classes that are part of the ontology structure and the properties associated to them increases, the ontology was divided into a set of smaller ontologies.Each of these ontologies contains elements that are closely related and they are defined to each other by the properties that bind them together.
An ontology created from CASE diagrams for on-line education is presented in Bagiampou and Kameas (2012); its evaluation is addressed by experts with a manual process.In this work, the focus is on the construction step, where the classes are extracted manually.The ontology creation process from the courses information offered in advanced levels is explained in Ameen et al. (2012), where students can choose courses according with their academic background.Both works present the structure, information, and hierarchy of the classes in a manual way.Other researchers are focused on online education such as Dai and Li (2010), Du et al. ( 2012) and recently Hssina et al. (2017), where ontologies are manually defined from XML resources available in the Internet, and the evaluation is a manual process too.On the other hand, Hu et al. (2016) proposes an ontology for the internet learning process.In both works is defined an ontology for each entity in the learning process, and the evaluation is conducted with a manual supervised process for domain experts.
There are works such as Uskov et al. (2016) focused on automatic learning; in this paper, an ontology based on the Internet of Things used in a classroom is created, considering the student intelligences.Méndez et al. (2015) proposes to use an ontological modeling for learning personalization that involves students profile according to the multiple intelligence theory by Howard Gardner as well as to use a domain ontology that helps to represent knowledge in virtual learning platforms.
The researches work one or more steps of the ontology learning process using a di_erent pedagogical domain.In Alemán et al. (2017) some researches focused steps of ontology learning process was discussed.

Corpus Creation
Some researches are focused on the corpus creation, defined for particular domains.Grljevic and Bosnjak (2015) discusses the creation of the relevant linguistic corpus written in Serbian language.The focus is on the sentiment analysis of student generated contents for higher education.In Teixeira et al. (2011) the problem of creating a reference corpus for news classification in fine grained multi-label scenarios was analyzed.The authors propose a semiautomatic approach for creating a reference corpus that uses three auxiliary classification methods: Support Vector Machines, Nearest Neighbor Classifiers and another based on a dictionary.

Class Extraction
Ontologies for the Use of digital learning Resources and semantic Annotations on Line (OURAL) is presented in Grandbastien et al. (2007), the project includes people from several disciplines (educational science, computer science, and cognitive psychology) building e-learning services.The authors present the extracted class using Natural-language processing (NLP) techniques in unstructured texts about learning situation.Educational domain was also analyzed in Fu et al. (2008), but its application was into Chinese language.
Others works like Ochoa Hernández (2011) present methods for semi-automatic class extraction using a database of Spanish verbs, diathesis alternations and syntactic-semantic schemes (ADESSE tool) (García et al., 2010), where the semantic extracted patterns are the classes.This methodology was applied in educational domain and replicated in financial domain in Ochoa et al. (2011); in both works, the class extraction was completed with the domain expert opinion.A method for class extraction using linguistic patterns and NLP metrics such as morphological labeling is presented in a recent research (Kang et al., 2014).
A data mining approach based on Ontology to classify web documents in order to facilitate applications based on classified text documents like search engines is proposed in Hajiabadi (2014).The ontology is generated by mining Wikipedia.Because of the collaborative efforts of lots of users in adding new articles, Wikipedia expand enormously and consequently it contains almost all of the fields and sub fields.The ontology was named WikiOnt, and contains all categories and sub categories exist in Wikipedia.
Theoretical focus that exist inWang (2010) describes two concepts on the SemanticWeb and Ontology and points out the core role of Ontology in Semantic Web.Moreover, a Double-Channel Helix Methodology was described.

Methodology
Figure 2 shows the global process for ontology creation (above) and the steps of the proposed methodology in these paper for the class detection process (down).A corpus was built using some academic papers which have two principal characteristics: This papers are focused on social sciences (pedagogy) and written in Spanish language.Besides, papers are related to the principal classes being extracted and joined in an initial corpus.In the last step, an analysis using clustering methods was implemented.Clustering analysis allowed to verify the concepts that the experts considered as the most important for the domain, coincided with the clusters.A Scikit learn tool was used for experiments and its online documentation for results analysis (Pedregosa et al., 2011).

Results
The papers from the input test were extracted and preprocessed in plain text.The result of this process is a corpus A with 51 instances, where each instance is a paper.This corpus can be described such as A = {K, T, C} where:


K is a paper key and a numeric attribute {1…51}.


T is the whole paper text, including the title and abstract.In texts, stopwords (Note 1), numbers and words with length less to 2 letters were deleted.


C is the instance class, this is a nominal attribute according to these principal topic in the paper.C = {LearningStyle, IntelligenceType, LearningStrategyg}.Each paper was manually labeled by domain experts according its title and conclusions; the corpus was balanced, thus exists 17 instances for each class.

Figure 2. Methodology proposed
Appendix A contains the papers included in the corpus, and Table 2 shows the vocabulary frequency in each class, this analysis was carry out after the initial preprocessing.

Table 2. Corpus Vocabulary
LearningStrategy class has more vocabulary, but this class contains two subcategories levels, then, the di_erence between the classes is justified.After the analysis it was concluded that the classes share many words.The final vocabulary corpus contains 18,563 elements.
According to the methodology proposed, a dictionary was built using the principal concepts in each class.First, an initial words list was searched in dictionaries, secondly the concept words and synonyms was added to a dictionary and finally, the stopwords were also deleted.The result was a dictionary with 336 words.For example:  Initial Word: Corporal (Type of intelligence)  Definition: la capacidad para utilizar el propio cuerpo para realizar actividades o resolver problemas  Synonyms: morfológico, físico, corpóreo, material  Terms Added to Dictionary: corporal, capacidad, utilizar, propio, cuerpo, realizar, actividades, resolver, problemas, morfológico, físico, corpóreo, material The cluster methods explained in section 4 were use for the experiments.These cluster were selected because is it possible to determinate the number of clusters as input parameter.In all cases, only this parameter was changed, and the rest of the parameters were established with their default value.
For the experiments, the real corpus classes were represented using numbers 0, 1 and 2 for LearningStrategy, LearningStyle and IntelligenceType respectively.A word frequency was extracted, as well as, the Term frequency Inverse document frequency (TF -IDF) metric (Manning & Schütze, 1999).Besides, two features sets were used: a corpus vocabulary as attributes and the dictionary words as attributes.Table 3 shows the created clusters in each algorithm and the corpus used.
Table 3. Clustering creates in each algorithm and corpus Greater balance is observed when using the dictionary words as features.Since the initial corpus is balanced in its three classes, one would expect the clustering results to be similar.At this point, a smaller number of attributes generates more balanced groups within the corpus.Table 4 shows application results of the metrics described in subsection 4.1.

Table 4. Clustering metrics
The best results in all metrics were obtained using the dictionary as features and the Birch algorithm, mainly considering the Fowlkes metric (96%).The worst results were obtained using the vocabulary as corpus and the Spectral algorithm, achieving in some metrics less than 5%.Figure 3 presents the best and the worst results obtained.
First, the actual labels (center) are represented with three balanced sets.To the left, the results of the Birch algorithm (dictionary) are shown, which are very similar to the actual tags.For example, in the first group only has one instance less than the actual set, and does not contain any extra element.This group corresponds to the LearningStrategy category.Instance 17 is grouped together with the category of LearningStyle along with all that actually belong to it.This can be seen in Table 3, where the row of real labels and the one corresponding to the birch algorithm with the dictionary are very similar, except for item 17 (highlighted in bold).Finally, the category of IntelligenceType (lower right group in the Venn diagram), is the one that is better defined, since this algorithm does not present overlap with the other two categories.With the methodology proposed, a corpus of pedagogical domain labeled manually can be labeled automatically using the Birch clustering algorithm and the dictionary built.

Conclusions and Future Work
In this paper, a proposal for a corpus creation for the pedagogical domain was presented, in addition to the use of a main concepts dictionary enriched with synonyms.This corpus was automatically processed and analyzed using clustering techniques, in order to find similarities between the clusters created by established algorithms and the real classes of the corpus.
This proposed is oriented to Spanish language, according to the work state, there are many researches in pedagogy and other domains, but the most of them are in English language and a few in others languages like Chinese.Also, this research involving three principal class and computational techniques for construction and class detection.This processes usually are reported using a manual approach.
As future work, the first tests for the class detection in the corpus will be carried out in order to start the ontological learning process.

2.
Identification of synonyms terms 3. Formation of concepts 4. Hierarchical organization of concepts (concept hierarchy) 5. Learning relations, properties or attributes, along with the appropriate domain and rank 6. Hierarchical organization of relations (relation hierarchy) 7. Instantiation of axiom schemata 8. Definition of arbitrary axioms Ontology population is usually accomplished through out three stages: Identification of candidate instances, classifier construction and instances classification.

Figure 3 .
Figure 3. Comparative between real clusters, clusters created by Birch algorithm (dictionary) and clusters created by spectral algorithm (vocabulary).The numbers represents the instances key