Automatic construction of concept maps

(1)

Automatic Construction of Concept Maps

Belinda Ng’asia Wafula

Master’s Thesis

Faculty of Science and Forestry School of Computing

May 2016

(2)

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu School of Computing

Student, Belinda Ng’asia Wafula : Automatic Construction of Concept Maps Master’s Thesis , 82p., 7 appendices (20p.)

Supervisor of the Master’s Thesis : PhD Wilhelmiina Hämäläinen May 2016

Abstract:

A concept map is a graphical representation of concepts and relations of some knowledge domain as understood by the user. A common issue arising is the difficulty in creating and evaluating different concept maps. Not even a human expert can say for certain what a ”correct” concept map should look like, hence the need for generation of semi-automatic or automatic concept maps. In this thesis, we give a literature review on different automatic and semi-automatic methods for constructing concept maps. Then we introduce a new automatic method for constructing concept maps. The heuristic applied to extract concepts is term occurrence. A similar principle is applied in extracting relations. Initial results show that sensible concepts and nouns occur more frequently in a given test material. More sensible relations between concepts also occur more frequently in the text. With syntactic analysis and auxiliary ontologies, term occurrence can be seen as a viable approach to constructing fully automatic concept maps.

Keywords: Concept maps, automatic construction, algorithm, text material

CR Categories (ACM Computing Classification System, 1998 version): K.3.1 Com- puter Uses in Education, H.3.1 Content Analysis and Indexing, I.2.7 Natural Lan- guage Processing

(3)

ii

Acknowledgement

The start and completion of this thesis would not have been possible without the abundant support from a number of people.

First and foremost, I would like to express my utmost gratitude to my supervisor PhD Wilhelmiina Hämäläinen for the endless support, guidance and encouragne- ment offered over the course of my study.

I would also like to acknowledge Professor Pasi Fr¨anti of the School of Comput- ing, University of Eastern Finland as the second examiner of this thesis, and I am gratefully indebted to him for taking his time to review this thesis.

Last but not least, I would like to express my gratitude to my parents, siblings and friends for their support and continuous encouragement throughout the years, without which, this would not have been possible. Thank you.

(4)

List of Abbreviations and Symbols

ACMC Automatic Concept Map Constructor

DM Knowledge Discovery in DB (first three chapters) TFSC Theoretical Foundations of Computer Science SW Scientific Writing material

s sentence

w_i word

M_w total number of extracted words

c_i concept (which is represented by a word or a group of words)

m(w_i) absolute frequency of word w_i (the number of times w_i occurs in the text)

m_rel(w_i) relative frequency of word w_i

m(c_i) the absolute frequency of concept c_i (the number of times c_i occurs in the text)

min_c threshold for concepts

w_i w_j consecutive words corresponding to a compound concept

m(w_iw_j) absolute frequency of consecutive words (compound concept); (the number of times the compound concept occurs in the text)

min_cc frequency threshold for compound concepts

m_s(c_i,c_j) number of times c_i and c_j occur in the same sentence in the text (absolute frequency)

m_rel(c_i,c_j) relative frequency of co-occurrence of concepts (c_i,c_j) in a sentence M_r total number of extracted co-occurring concepts in a sentence min_r threshold for co-occurrence of concepts in a sentence

(5)

iv

(6)

Chapter 1 Introduction

Concept maps have been defined as graphical representations of concepts and their inter-relationships that are intended to represent the knowledge structure that humans store in their minds [NG84].

In this thesis we give a systematic overview of existing approaches for automatic and semi-automatic concept map generation. We also introduce a new fully automatic method for generating concept maps from text based on frequencies of occurrence and co-occurrence of concepts.

1.1 Motivation

For educational purposes, concept maps are used as a learning tool for the students.

An effective concept map can be considered as a map that is easily understood by a second party. Constructing an effective concept maps is sometimes considered as a complex task as users may find it difficult to remember some concepts of a certain topic hence the need for automatic construction of concept maps. By constructing concept maps, students can have an overview of a given topic. Teachers can also use concept maps as a reference tool to see if all relevant relations are represented in their material.

1

(10)

1.2 Objectives

The first objective of this research is to investigate computational approaches applied in construction of concept maps from text. We concentrate on automatic methods used in constructing concept maps although we review also some semi- automatic methods based on interaction with the user.

The second objective of the research is to develop and evaluate a new fully method for constructing concept maps from text based learning material. We design, im- plement and test the system. The new algorithm employs frequency of occurrence of a term in a text to extract and select salient concepts. Potential relations are identified if extracted concepts occur in the same sentence.

Lastly, we evaluate the frequency based method of approach used to construct concept maps automatically and how it works with different kinds of texts.

1.3 Organization of the thesis

The organization of this thesis is as follows: We begin in Chapter 2 by presenting the notion of concept maps, their uses, types and how they are constructed. Chapters 3 and 4 describe semi-automatic and automatic methods of constructing concept maps respectively. In Chapter 5 we introduce a new method for constructing concept maps from text. The experiments are reported in Chapter 6 and the final conclusions are drawn in Chapter 7.

(11)

Chapter 2 Concept maps

This chapter briefly describes the theoretical foundations of concept maps. The uses and applications of concept maps are summarily reviewed in this chapter.

2.1 An overview of concept maps

Novak [NG84] defines concept maps as ”representations of concepts and their inter- relationships that are intended to represent the knowledge structure that humans store in their minds”. A concept map can also be described as a graphical representation of the user’s knowledge in a given domain [MSS99]. A concept is described as some regularity within a group of facts and is designated by some sign or symbol.

Usually, concepts are represented by words or word groups (especially nouns and noun phrases). Novak gives an example of a ”chair” which is a label/sign for an instrument with four legs, a surface to sit on and a back to rest against.

Concept maps are composed of concepts which are enclosed in circles or boxes (nodes) and relations between concepts, indicated by a connecting line linking two concepts. Words or groups of words (typically verbs) on the connecting line depict a labeled relation, and are known as linking phrases [SKUP⁺04]. Relations between concepts can be represented as an unlabelled line between two nodes (Figure 2.1), a labelled line describing the relationship (Figure 2.2), an arrow showing the direction of the relationship between the concepts (Figure 2.2) or a line and a special symbol

3

(12)

at the end of the line showing the type of relationship (Figure 2.3).

Figure 2.1: An unlabelled and non-directional relationship between two concepts.

Figure 2.2: Labelled and directional relationships between concepts.

The process of constructing concept maps begins with identifying a familiar domain.

The topic of a concept map can be text or a particular problem or question to focus on. The next step involves identifying key concepts, from the most general inclusive to the less inclusive concepts that apply to the domain in focus. The last step is to identifying relations between concepts and to find the appropriate words to describe the relations (to form a meaningful proposition follows after) [NC06].

In its simplest form, a concept map has two nodes and a connecting line between them. Types of concept maps are viewed from two extremes; hierarchical or tree- structured maps and mind maps. Novak [NC06] deems hierarchical concepts maps to be ideal. They are constructed in a tree-like manner, with the more general concepts at the top of the map and the more specific concepts hierarchically at the bottom of the map. On the other hand, mind maps are constructed freely from a key idea, allowing any kind of association. Figure 2.4 shows an example of a

(13)

2.2. APPLICATIONS AND USES OF CONCEPT MAPS 5

Figure 2.3: Type of relationships between concepts.

hierarchical concept map and Figure 2.5 shows an example of a mind map.

In some cases, concept maps consist of extensions that clarify and complement the concepts. Such extensions include resources such as Web pages, pictures, examples and text in the concept map [nHC⁺04].

2.2 Applications and uses of concept maps

Concept maps have been widely used in education. They have been demonstrated to be a successful instructional tool to help learners in their understanding process.

Concept maps are popular as they aid in creative thinking, knowledge extraction, planning, note taking, summarization [SRF03], idea generation, knowledge creation [AKM⁺03] and as assessment [HBN96] and evaluation tools [MMJ94]. Concept maps can also be used to summarize papers. According to [RF05], a concept map can be as good a summary as an abstract, and are easier to automatically prepare and translate than a written abstract.

David et. al [DSB] have used concept maps and concept questions for engineer- ing university level to help in their conceptual understanding of the discipline and stimulate thinking. In [WSL06] concept maps have been used in searching through historical archives. These maps provide a representation of the important retrieved entities, that might be used in later searches. Maria [Jak03] demonstrated the appli-

(14)

Figure 2.4: Sample concept map of a concept map [NC06].

Figure 2.5: An example of a mind map representing the author’s understanding of Educational Technology course.

(15)

2.2. APPLICATIONS AND USES OF CONCEPT MAPS 7 cation of concept maps in conjunction with practical and cognitive apprenticeships to teach and improve programming skills in holistic learners. The use of concept maps proved to stimulate meaningful learning in undergraduate medical students taking a PBL (problem-based learning) [RFP06]. McClure et al [MSS99] researched on the use of concept maps to assess learners’ knowledge on certain concepts.

The use of concept maps is not restricted to education, but they are used in busi- ness planning, public administration and health sector, among others. Concept maps have been employed in community mental health [JBS00] for program planning and evaluation purposes. Compared to other knowledge elicitation tools, concept mapping is considered as an efficient method for generating models of domain knowledge [HCCN02]. When integrated with other systems, concept maps have been used as interfaces for intelligent software (i.e., knowledge based systems and tutoring systems) in various domains [CCH⁺03].

From an educational instructor’s point of view, concept maps can be used to reveal a learners’ understanding or misconception [RRS98] of a certain knowledge domain.

There are no ”correct” concept maps but often the teacher’s concept map is used as a reference map [dRdCJF04]. However, a teacher’s map reflects the teacher’s way of thinking. For a more objective map, a different approach used to construct the concept map is applied. Automatically constructed concept maps are less biased, easy to generate and can be used as reference maps. Hideo et. al [FYI02] developed a concept mapping software that ”supports the externalization of ideas, reflection on thinking processes and dialogues” by allowing collaborative learning by permitting several users to construct one concept map. There have been several tools like CmapTools [LMR⁺03], Clouds [POC00], Leximancer [SH05] and GNOSIS [GS94], which attempt to construct concept maps, in interaction with the users to generate concept maps automatically.

In summary, a concept map is a type of knowledge representation to develop mental schemas or mind maps that act as a reference for future actions and thinking [BB00].

Concept maps can be applied in different areas and not limited to the education field.

(16)

A common issue arising, is the difficulty in evaluating different concept maps. Not even a human expert can say for certain what a ”correct” concept map should look like. Therefore, it can be hypothesized that an automatically constructed concept map has a reduced degree of bias compared to a manually constructed concept map.

(17)

Chapter 3 Semi-automatic construction of concept maps

Semi-automatic construction of concept maps is an approach where a software tool is used to create concept maps with the help of the user.

In this chapter, we review four tools dedicated to assisting in the process of constructing concept maps. These methods suggest elements (concepts, topics or relations), based on a given domain. As these methods are used for semi-automatic construction of concept maps, the role of the user in the process of construction of concept maps will be discussed. In the section below, we present the algorithms used in the existing methods for extracting and suggesting elements. We give a short summary and comparison of the introduced tools.

In the following section, we introduce four tools which are used in the area of semi- automatic concept map construction; Clouds[POC00], Textstorm[APC01], Cmap- Tools[LMR⁺03], and a tool for semi-automatically constructing topic ontologies [FMG05].

3.1 Textstorm

With no prior knowledge about the domain in focus,Textstorm [APC01] parses and tags a text file producing binary predicates (e.g ”eat(cow, plants)”). The system feeds the output into another system Clouds [PC00].

9

(18)

Textstorm tags a text file using WordNet [MBF⁺90]. The predicates built map relations between two concepts from parsing sentences. Since concepts in a text are not named every time by the same name, Textstorm uses synonymy relationship from WordNet to find concepts previously referred with a different name. In Textstorm, relations are identified as verbs in a sentence, with the subject as the first concepts and the object (verbal phrase) as the second concept in the predicate (e.g., in a sentence ”Jupiter is a big planet”, Textstorm builds the predicate ”isa(Jupiter, big)”.

The resulting predicates produced act as inputs to Clouds [PC00], a system that, through interaction with the user, builds a complete concept map.

3.2 Clouds

Clouds [POC00] is a program that suggests concepts and relations to the user. The user first ”feeds” the program with the basic concepts in the domain. The user provides concepts to be focused on based on the questions provided by Clouds.

Two algorithms are used. One algorithm selects which concepts to work with in the concept map. The next two algorithms are based on inductive learning and suggest concepts and relations for the concept map.

First, the domain knowledge is given as an ontology of primitive concepts (an isa- tree). The first three levels of the tree are fixed but the user can then add new concepts to the tree. Figure 3.1 shows a graphical representation of the first three levels provided to Clouds as the ontology base.

The tasks performed by Clouds are as follows:

1. Clouds starts by selecting the most relevant but not fully explained concepts to work with from the map. The relevance of a concept is defined as follows:

Definition 1

Rel(c₁c₂) =number of separate paths between c₁ and c₂ AbsRel =X

c1

Rel(c₁c_i)

(19)

3.2. CLOUDS 11

Figure 3.1: First three levels of the ontology base used in Clouds.

2. The program aims to find, for each relation, components that are related by the given relation. With the help of the isa-tree, the program analyzes the existing relation, establishing which categories are typically linked by the relations. For instance,

Example 1 If it is observed that:

produce(apple tree, apple) produce(pear tree, pear)

and from the domain knowledge, it is known that

isa(pear tree, tree) isa(apple tree, tree) isa(pear, f ruit) isa(apple, f ruit)

a generalization is deduced

produce(tree, f ruit)

Generalization defined to be obtaining categories up the tree, while special-

(20)

ization is obtaining categories down the tree. Clouds searches for the most general specializations that ”avoid” this new observation. The results of the search as split into binary predicates that represent the pairs of category of the arguments that cover the positive examples. The final step involves explaining relations in a given context. The algorithm used to learn the relations is based on inductive logic programming [MF90]. In the map, the context is defined as relations each argument has with other concepts, up to a predefined depth. In this phase, generalization may happen in two ways, universal quantification or dropping of a term, illustrated in figures 3.2 and 3.3[POC00]. The dropping of a term occurs when a new observation reflects and over-specialization of a clause.

Figure 3.2: Joining the contexts leads to a generalization.

Figure 3.3: Universal quantification of the first argument of eat.

Specialization occurs when a negative example is given. Clouds selects clauses that ”cover” the example and finds all positive declared predicates covered by

(21)

3.3. CMAPTOOLS 13 the selected clauses. A new hypothesis is generated by adding to the previous clause terms that are satisfied by the positive example.

3.3 CmapTools

CmapTools [LMR⁺03] is a tool that allows the user and the program to interac- tively construct concept maps. CmapTools allows intergration of other multimedia resources into the concept maps. Starting from an incomplete concept map, Cmap- Tools suggests concepts, proposition, other concept maps and new topics for the users.

3.3.1 Suggesters for concepts

InCmapTools,Concept suggester is a module used for searching and suggesting new concepts [CCA⁺02] to be used in the concept map.

Based on the current map, the system mines the Web for relevant documents, which are cached to be used for mining concepts at a later stage. The current map is converted into a text query, which is used to retrieve additional relevant documents. From the collected documents, a search is made for the documents related to the current concept map.

To search for concepts to suggest, the system searches from the of retrieved documents, concepts already in the map. For each concept found, neighboring words, defined by a distance threshold (currently 3 words), are identified as potential concept suggestions. The frequency of the terms is used to determine the concepts to be suggested.

3.3.2 Suggesters for propositions, concept maps and multi- media resources

This part of the system applies case-based reasoning [Kol93, Lea96] to provide proposition and concept map suggestions by analyzing prior knowledge models.

(22)

When a user wants to ”extend” a concept map, the system views the original map and prior concept maps as examples of how that concept was extended in the past.

Category index computed from the concept map library, is used to organize concept maps into a hierarchical structure. The index of each category maintains references to the original map and a cluster representative. The cluster representative is used to determine if a new concept map is related to the maps in the category.

Concept map similarity is computed from a vector representation of the concept maps. The system assigns higher and lower weights to the keywords from the top and bottom of the concept maps respectively. The weight of a keyword i of concept k in map Cj is computed as:

w_ijk =f req_ijk.(αn+βm).(_(d+1)¹ )¹^δ

where C_j is a concept map of library of maps L, f req_ijk is the raw frequency of keyword i in the label of conceptk.

The total weight of keyword i inC_j is the sum of all weightsw_ijk for all concepts k in map C_j.

Users can initiate new search for concepts or multimedia resources by selecting concepts for which extensions are sought. The suggester converts the map into a vector representation and extracts keywords selected by the user or the suggester.

The keywords are used to search for suggestions in a case while the vector is used to perform a binary search for the best-fitting category.

Suggestions extracted are ranked by means of a key word correlation metric, which is based on the distance between concepts within a concept map. The distance based correlations between keywords i and j is computed as:

M_χ(i, j) = _|Θ ²

i|+|Θj|× X

C∈(Θi∩Θj)

1 DC(i, j)

where Θ_i and Θ_j are the set of maps in χ containing keywords i and j. DC is computed as 1 + minimum number of links between concepts containing i and j From keywords(i, j), the rank is computed by extracting i from the potential suggestions and j from the selected concepts in the map being constructed.

(23)

3.3. CMAPTOOLS 15

3.3.3 Suggesters for relevant topics

EXTENDER (EXtensive Topic Extender from New Data Exploring Relationships) is a module that suggests novel topics, presented as small collections of terms, to be included in the knowledge model. The approach applied mines the Web using information automatically gained from the current concept map. The system takes the knowledge model as input and mines the Web for topics related to the current model. New information is used to guide further searches, with each topic generated, specifies by a set of weighted terms. The EXTENDER goes through the following steps to generate topics:

1. Apply topological analysis to convert concept maps to a vector form and generate initial corpus

2. Combine weighted terms to produce first generation of artificial topics 3. Steps 4-10 repeated until the final generation of topics

4. Define similarity threshold using diversity factor 5. Define context for search using

6. Generate queries for a web search engine

7. Filter irrelevant results using context and similarity threshold 8. Identify relevant novel keywords and update the corpus

9. Use the diversity factor to integrate returned results with prior information and complete the term web page matrix

10. Apply term clustering to the term-web page matrix to obtain new generation or artificial topics.

(24)

3.4 Semi-automatic construction of topic ontol- ogy

Ontologies can be interpreted as a special case of concept maps. A topic ontology is one example. A topic ontology is defined as a set of topics connected with different types of relations [FMG05]. The method proposed in [FMG05] applies Latent Semantic Indexing (LSI) [DDF⁺00] and K-Means Clustering [JMF99] to discover and suggest topics within a corpora.

The text documents are first converted into a vector representation using standard Bag-of-Words (BOW) and TFIDF weighting [Sal91]. Cosine similarity, similarity between two documents, is computed as the cosine angle between two vector representation.

Topics are extracted from documents using LSI and Singular Value Decomposition (SVD) and BOW for detecting words with similar meanings. K-Means clustering algorithm is used to cluster documents that share similar words.

Extracting keywords from documents involves two methods: keyword extraction using centroid vectors of a topic. In this context, centroid is the sum of all vectors of the document inside the topic. Key words are selected based on the weights of the centroid vectors of a topic. The second method involves a Support Vector Machine (SVM) binary classifier [Joa99]. They use the following example to illustrate this method. Suppose A is a topic to be described with keywords. All documents with A as a subtopic are marked as negative, and documents from topic A are marked as positive. If a document has both positive and negative marks, its is marked as positive. An SVM classifier is used to classify the centroid of topic A. Keywords are words whose weight in the SVM normal vector contribute most when deciding in centroid is positive.

In recent years, other systems related to semi-automatic construction of concept maps, by building and learning ontologies, have been developed. SOAT [WH02]

uses parts-of-speech structure and rules for Chinese language to extract concepts and

(25)

3.5. COMPARISONS 17

Table 3.1: Screen showing the relations found to a concept.

relations. Onto-Learn [NVG03], ASIUM [SZL14] and Adaptiva [BCW02] use similar methods, they apply linguistics patterns and machine learning in the extraction process. Text-To-Onto [MS01, MS04] employs different extraction approaches and combines the results to support construction of ontologies.

3.5 Comparisons

(26)

(27)

Chapter 4 Fully automatic construction of concept maps

Fully automatic construction of concept maps means that the system constructs a concept map from a source, for instance, a text document, without the user being involved in the process. In this chapter, we present those works that have made a contribution to the progress in the field of automatic concept map construction.

In this chapter, we present an overview of the different approaches to automatic concept map construction. This section is dedicated to discussing some of the existing works and applications related to fully automatically constructing concept maps.

We will then make a comparison of the different applications discussed. We see the different forms of resources and initial inputs used in the construction of concept maps and the different methods and approaches utilized in the process of concept map construction. We also look at the final products produced by each of the discussed applications.

4.1 GNOSIS

Gaines and Shaw [GS94] developed a system they called GNOSIS. This system automatically produced concept maps purely based on occurrence of words in a sentence, a technique commonly used in information retrieval systems [CLR86]. The

19

(28)

system is able to extract related concepts but does not label the relations found.

Unfortunately, not much has been documented on the algorithms used to extract the concepts and their unlabeled relations in GNOSIS.

4.2 Relex

A somewhat more sophisticated approach used to generate concept maps automatically, was used by Richardson and Goertzel [RGFP06]. Relex tool uses grammatical analysis to extract noun phrases and noun-verb-noun relations. Relex uses template matching algorithms to convert syntactic dependencies to graphs of semantic primi- tives [Wie96]. Relex changes passive and active forms into the same representations and assigns tenses and numbers to sentence parts. The system used CMU’s link parser [ST93] and WordNet for morphological functions [Fel98].

4.3 Concept Frame Graph

In [RT02], a collection of documents are represented as a special kind of concept map, called a concept frame graph. The nodes of the graph are described as concept frames.

Definition 2

Concept frame is an object represent as [NAME, SYNSET, RELS, CONTEXTS]

where NAME is the name of the concept, SYNSET is a set of synonyms of the concept, and RELS is a set that describes the relations of the concept with other concepts. Each relation is represented as a tuple (AgentCF, RELS, ObjectCF) where AgentCF and ObjectCF are pointers to concept frames and RELSis the relation between them. CONTEXTSis an optional set of text segments corresponding to each relation tuple in RELS.

The concept frame graph is constructed in the following steps:

• Pre-processing: menu bars or formatting specifications are removed from documents.

(29)

4.3. CONCEPT FRAME GRAPH 21

Figure 4.1: Graphical representation of a concept frame.

• Name entity recognition: all entities are identified using a co-occurrence resolution algorithm [ZS01], which are then extracted from the documents.

• Grammatical analysis: parts of speech are tagged, using a set of rules they invented themselves, resulting to a NVN 3-tuple, (NC, VC, NC), where NC is a noun clause and VC is a verb clause.

• Word sense disambiguation: this step involves sense disambiguation of the extracted noun clauses. A handcrafted algorithm, using WordNet [Fel98], picks the correct word sense based on the context of the noun clause.

• Clustering: A fuzzy ART [CGR91] based clustering algorithm is applied to cluster the disambiguated noun clauses. For this purpose, the noun clauses are first converted to vector. All key terms are extracted from the parts-of speech information to form a weight vector, c = (c₁c₂...c_m) where m denotes the number of features extracted and ci the term frequency for termi,i = 1 ... m.

(30)

The vector is then normalized by dividing all elements with max_ic_i.

• Frame filling: A collection of cluster members forms a SYNET. RELS are formed by generalizing the NVN 3-tuple. Sentence fragments corresponding to the relation tuple are collected to form the CONTEXTS. The name of the frame is established as the most dominant member of the SYNSET.

4.4 Using concepts maps in digital libraries as a cross language resource discovery tool

Richardson and Fox [RF05] used a somewhat similar approach to grammatical analysis used in [RT02]. The notion of part-of-speech is applied to find noun phrases from electronic theses and dissertations (ETDs). These are extracted using Mon- tyTagger [Liu03] program and are used as nodes in the concept map. Verbs or prepositions are used as the links between the nodes. The selection of the linking word is based on frequency of occurrence in the document against the frequency in the language used. They had two choices on how to select concept maps. Only the most important concepts were selected and included in the concept map, based overall document. They also had maps, based on each chapter of the document.

The nodes of the maps were identified based on their part-of-speech. The second method involved using chapter and section heading as a skeleton concept map, then selecting terms based on how frequently they appear with words in chapter and section headings. Several relation extraction methods were tested. Pearson’s Chi-squared, Dice’s co-efficient and mutual information were found not be ideal as they favored uncommon terms in the text. Association rules [CGR91] and t-scores produced relevant relations.

4.5 Identifying and extracting relations in text

Other similar research has been done by Roy and Yael [BR]. A collection of extractors, referred to as Textract identify terms such as people’s names, places, or-

(31)

4.6. LEXIMANCER 23 ganizations, abbreviations and other special single words in document collections.

One particular extractor, name extractor [RW33] identifies capitalized words and selected prepositions, as potential names. These names are categorized based on their types, for example, a person or a place.

The frequency of occurrence of the concepts identified by the extractors serves to identify the most significant concepts.

4.6 Leximancer

A more sophisticated system of constructing concept maps automatically has been developed [SH05]. Leximancer [Smi05] is a data mining tool that extracts information from text documents and represents the information as main concepts and their relations. Concepts in Leximancer are defined as ”collection of terms that provide evidence for the use of the concept in the text”. In addition, Leximancer identifies proper names (words that start with capital letters) as potential candidate concepts. Leximancer extracts and measures the frequency of the main concepts.

Concept extraction phase begins with Leximancer identifying ”seeds” words, which form the starting points of the concepts. These are the most frequently appearing words that are not stop words. Concepts are established based on the seed word and words associated with the seed word. This is done by identifying words that occur close to the seed words. This process is known as concept learning. Learning of concepts involves the following steps:

1. The relevancies of a seed word and all other words in the document are calculated.

2. If the relevancies fall above a set threshold, the words are added to the concept definition list.

3. The relevancies of other words in the document and all the new concepts definition list is calculated.

(32)

4. If the relevancies fall above the threshold, the words are added to the concept definition list again.

5. The learning stops when the number of sentence blocks classified by each concept remains stable

Leximancer determines relationships by measuring the closeness and frequency of extracted concepts in the text. Here, a window( specified length of words or sentences) is moved sequentially through the text, and the concepts within this window marked. The frequency of co-occurring concepts against all other is calculated, resulting to a concept co-occurrence matrix.

Leximancer uses Bayesian decision theory and word association norms to compute the relevancy measures.

4.7 Two phase concept map construction

In other instances, work is focused on extracting relations. Sue et al [SWST04] pro- pose a different approach for constructing concept maps for a course from historical test records, Two phase concept map construction (TP-CMC). TP-CMC involves two phases, Grade fuzzy association rule mining and Concept map construction.

TP-CMC uses a table, Test item concept mapping table that records related concepts of each test item in a quiz. This algorithm aims not to extract concepts, as the concepts have already been established, rather the algorithm identifies pre-requisite relationships among the concepts in the test items and constructs a concept map based on these relations.

Grade fuzzy association rule mining phase. This phase involves the following steps:

1. Grade fuzzification: this process involves the application of Fuzzy set theory to convert numeric grade data into symbolic notation, ”Low”, ”Mid” and ”High”

representing low, middle and high grade respectively.

(33)

4.8. RELATED SYSTEMS 25 2. Anomaly Diagnosis: discrimination of an item is used to set good test items apart from bad test items. This step aims to refine the input data by reducing the redundant data not to be used in the concept map. If the discrimination of the sets is too low (most students get high scores or low scores) this item is considered redundant. To remove redundancy, the input data, Fuzzy item analysis for norm-referencing (FIA-NR) is used.

3. Fuzzy data mining: In this step, the algorithm recognizes the existence of relationships between two test items. Look ahead fuzzy association rule mining algorithm [TTL01] is used to find fuzzy associations between the test items.

Concept map construction phase. In this concept construction phase, further refined association rules, based on observation of real learning situations, are used to analyze the pre-requisite relationships between learning concepts in quizzes. A proposed algorithm, Concept map construction algorithm, is used to find corresponding concepts of concept sets to construction concept maps. The algorithm is based on the Test item concept mapping table and pre-requisite relationships. Finally, the Cycle detection process is used to detect and delete unwanted pre-requisite relationships that form a cycle between concepts.

4.8 Related systems

[Coo] concentrates on finding relations in a collection in the biomedical domain.

Relations between terms are computed based on proximity and frequency. If two terms occur near each other often, then there exists a stronger relation between the specific terms, compared to if these two terms occurred close together once.

Furthermore, the weights of the relations are computed using the following formula:

m =log[totalterms×paircount f req1×f req2 ]

where totalterms is the total number of unique terms in the collection, paircount is the number of documents in which both terms occur, freq1 and freq2 are the

(34)

frequencies, of the two terms, with values of m (mutual information values for the pair terms) lying between 0 and 100. Diagram 4.2 shows the output produced.

Diagram 4.2 was cited from [Coo]

Figure 4.2: Screen showing the relations found to a concept.

4.9 Comparisons

Table 4.1 summarizes the approaches for automatically constructing concept maps that have been earlier discussed.

(35)

4.9. COMPARISONS 27

Table 4.1: Screen showing the relations found to a concept.

(36)

(37)

Chapter 5 ACMC: An automatic concept map constructor

In this chapter, we introduce ACMC, an automatic concept map constructor. The ACMC will enable learners, instructors and evaluators to construct concept maps automatically from text (learning material).

5.1 Overview

The main processes involved in ACMC is the extraction of words from the text, finding significant concepts, and extraction of potential relations between the extracted concepts. The basic method used in ACMC to extract concepts is term occurrence.

No syntactic analysis, auxiliary ontologies or other resources are needed by ACMC to extract concepts. Potential relations between concepts are extracted based on whether the concepts occur in the same sentence.

As Input, ACMC requires a number of external data sources in the construction of the concept map. ACMC requires a file containing the text material in which to extract concepts and relations. ACMC requires two additional files. One containing a list of stop words. Stop words are are words that are so common that they are useless to index. ACMC uses this file to access the stop words that will be eliminated from the list of extracted words. The second file contains a list of identified irregular plurals and their corresponding singular. Examples of irregular plurals are indices,

29

(38)

thesis, automata among others. See appendix E.

ACMC produces a text form concepts map; this is a list of single concepts and one word and compound concepts as two words. The relations extracted could be between two single word concepts, two compound concepts or between a single word concept and a compound concept. This is represented by with each concept separated by curly brackets.

The main principle applied by ACMC in extraction of concepts is based on calculating the frequencies of word occurrences or co-occurrences of two words. Note that occurrence of a word refers to how many times a word has occurred in the text either in its plural or singular form. Co-occurrence of two words refers to how often two consecutive words occur in the same sentence. A similar approach was used to establish potential relations.

The following steps describe the method used in the extraction of concepts and relation:

1. Extract words from a given text.

2. Calculate the frequencies of the extracted words.

3. Remove stop words.

4. Merge plurals and singular forms of words.

5. Prune infrequent words.

6. Construct compound words.

7. Calculate the frequencies of compound words.

8. Find relations between concepts in a sentence.

9. Prune infrequent relations.

10. Display the map.

(39)

5.2. CONCEPT EXTRACTION 31

5.2 Concept extraction

ACMC extracts concepts from text by first reading a file containing the text material. ACMC extracts words and counts the frequency of the words. All extracted words are put into a binary search tree.

5.2.1 Extracting and counting the frequency of words

ACMC reads a given text file and extracts the words. The frequency of the word is calculated by increasing the counter associated with a new word by 1 each time a new word is encountered. Every time the word is encountered, the counter is increased.

A balanced binary tree (red-black tree), for storing the extracted words, is used for efficient indexing [CSRL01].

5.2.2 Pruning stop words

Stop words are words that are so common that they are useless to index. Stop words are part of the English grammar and are used in sentences to form grammat- ically correct sentences, and scientific texts are no exceptions. In English, the most common stop words are ”a”, ”of”, ”the”, ”you” among others. See appendix D.

ACMC uses a list of 100 most common stop words in English [Tex] addition, we have added to the list a number of words which occur often but are meaningless for concept maps. Such words ”too”, ”section” and letters of the alphabet. On the other hand, some stop words were excluded from the list, words such as ”time” as they were considered important concepts in the domain.

5.2.3 Plurals

In the text, a concept may occur in their singular and plural forms. For this reason, we combine the singular and plural forms to represent one concept and get the correct frequencies of a concept.

(40)

The most common plural forms are identified with the following heuristics: if a word ends with ’s’, and the same word without ’s’ occurs in the text, then they are considered as the same concept. A slight problem occurs as words are not identified as nouns, as verbs can also end with ’s’.

If a word is an irregular plural form, its corresponding singular form is obtained from a list. See appendix E.

5.2.4 Pruning infrequent concepts

Assumption: the higher the frequency of a concept, the more significant it is.

In this case, we decided to set the threshold to be the absolute frequency of occurrence of a concept as ≥ 3. The user can decide the threshold, but should note the following:

• Size of the text document ( a larger document would require a higher threshold).

• Writing style used in the text. For instance, text with contents in bullet and list forms would require a different threshold from text written in a book format.

• Contents within the text. For instance, text that contains mostly formulas and equations would require a lower threshold as compared to text with plain text.

We define relative frequency of a concept to be the absolute frequency of a concept divided by the total number of concepts extracted:

m_rel(w_i) = m(wi)

Mw

.

5.3 Identifying compound concepts

Merriam Webster dictionary defines a compound concept as a word consisting of components that are words, representing a generic idea. In the context of this

(41)

5.3. IDENTIFYING COMPOUND CONCEPTS 33 chapter, we define a compound concept as two words w_iw_j that occur consecutively in the same sentence, for example ”association rule”.

The pseudocode for extracting compound concepts from the given text and prune out infrequent compound concepts is given in Algorithm 1

Algorithm 1 FindCompoundConcepts(DataFile, tree, FreqThreshold) For all sentences s = w₁ w₂ w₃ ... w_n in DataFile

For i = 1 to n-1 w1 =w_i

IF ((tree.Find(w1) != NULL) AND (i< n−1)) w2 = w_i+1

IF (tree.Find(w2) != NULL)

CompArray[w1.index][w2.index]++

size = CompArray.Length For i = 1 to size -1

For j = i+1 to size

Freq = CompArray[i][j]

IF (Freq ≥minc)

Output compound concept w_i w_j and Freq IF (_m(w^{F req}

i) ≥min_cc)

Remove w_i from the tree // w_i occurs seldom alone IF (_m(w^{F req}

j) ≥mincc)

Remove wj from the tree // wj occurs seldom alone

In principle, compound concepts with more than two consecutive words can be identified in a similar way, if the heuristics is extended to accommodate them. what kind of compound concepts cannot be identified with this heuristic

Once the compound concepts were established, a set threshold, mincc = 0.90, was used to prune the infrequent compound concept. The user can decide the threshold to be used, but note that it should be large enough.

(42)

5.4 Relation extraction

In extracting potential relations, we assume that two concepts are related if they appear in the same sentence. It is worth noting that, two consecutive concepts are not considered as related if they form a compound concept. The pseudocode for extracting potential relations between concepts is given in Algorithm. 2

Algorithm 2 FindRelations(DataFile, tree)

For all sentences s = w₁ w₂ w₃ ... w_n in DataFile For i = 1 to n-1

w1 =w_i

IF ((tree.Find(w1) != NULL) AND (i< n−1))

w3 = w_i + w_i+1 // compound concept

IF (tree.Find(w3) != NULL) j = i + 2

ELSE IF tree.Find(w1) != NULL j = i + 1

IF ((tree.Find(w1) = != NULL) OR (tree.Find(w3) = != NULL)) w2 = w_j

(IF (tree.Find(w2) != NULL) AND (j < n))

w4 = w_j + w_j+1 // compound concept IF (tree.Find(w2)!=NULL)

Relations[w1.index][w2.index]++

IF (tree.Find(w4) != NULL) IF (tree.Find(w1) != NULL)

size = Relations.Length For i = 1 to size -1

For j = i+1 to size

Freq = Relations[i][j] + Relations[j][i]

IF (Freq ≥min_r)

Output Relation w_i, w_j and Freq

(43)

5.5. DEVELOPMENT IDEAS 35 In the heuristic used to prune the infrequent relations, we assume that, the higher the frequency of co-occurring of concepts in a sentence, the more significant the relations is.

The last step involves displaying the results. It is worth noting at this point that, the results displayed are of concepts that participate in relations. The results could be seen as being small groups of concept maps instead of a large connected concept map.

5.5 Development ideas

Concept maps are considered as a graphical representation of one’s knowledge in a certain domain. ACMC is an application that aims at automatically constructing concept maps from text. The current implementation of ACMC does not offer a graphical user interface. This exclusion would cause a problem to one who is not familiar to ACMC. Creating a GUI to map the concepts and the relations between the concepts extracted from the text would make it easy for the user to comprehend the concept map and show the results in a visual form. On the other hand, educationally, this might help students identify which concepts are more relevant once they are given a list of concepts.

The approach used in extracting concepts in ACMC is based on the frequency of occurrence of a word. Other approaches could be integrated into ACMC. For instance, checking for concepts in headings; chapter headings, section headings, topic and introductory sentences of paragraphs, and emphasized words, table and figure captions. These considerations can be analyzed and implemented in ACMC as different techniques of extracting concepts. To further substantiate the significance of concepts, concepts could be passed through a series of tests to establish their relevance to the text. For instance, ACMC could check if the extracted words appear in any headings, topic sentences or were emphasized. These concepts could be considered to have more significance to the text than other concepts that appeared in normal text.

(44)

In ACMC, compound concepts are considered as two consecutive words. This notion can be expanded taking into account that compound concepts can consist more than two words.

Only one measure (co-occurrence of concepts in the same sentence) of extracting relations has been implemented into ACMC. Other means of finding relations could be analyzed and implemented as well. Potential relations could be determined if concepts appeared in paragraphs as well as in sentences. The significance of relations could be weighed such that if a relation occurred in a sentence as well as within a paragraph, then this would weigh more than if the relation appeared within the paragraph only, or in the same sentence.

(45)

Chapter 6 Tests and Experiments

In this chapter, we present the results obtained from the experiments performed on the ACMC. We make comparisons between the ACMC constructed concepts maps and the manually constructed concepts maps. We will also present results obtained from running the test data through Leximancer, a tool for automatic concept map construction.

6.1 Test cases

After the design and implementation of the ACMC application, several tests were made.

• Basic statistics: Given different thresholds, the following requirements were tested:

– The program extracts single word concepts and their frequencies correctly.

– The program extracts compound concepts and their frequencies correctly.

– The program identifies singular and plural words correctly and combines their frequencies.

– The program extracts potential relations:

1. between single word concepts 37

(46)

2. between single word and compound concepts 3. between compound concepts

• Comparison to human drawn maps: We compared the results from human constructed maps against the results obtained from the ACMC

• Comparison to Leximancer: We compared the results from the human constructed maps and the ACMC against the results from the Leximancer.

This test served to see if there were differences among the different approaches applied to extract and construct concept maps

6.2 Data material

One of the areas of focus on testing was to see how the ACMC would work with different types of learning material. Three different test data were used in testing and four different frequency thresholds were applied on each of the test data. The lowest frequency threshold to be used was 3, which in this chapter is sometimes referred to as ’All’. This was used on the assumption that sensible concepts would appear atleast more than three times in the test data, while still pruning out all the non-sensible words. It is worth mentioning at this point that the test data were all in latex format.

• Test data 1: A data mining book. The first three chapters of Knowledge Discovery in DB: The search for frequent patterns by Heikki Mannila and Hannu Toivonen. This learning material contained text material written in a book format that included chapters, sections, sub-sections and full sentences explaining the concepts in the text, a total of 37 pages. In the context of this chapter, this text material will be abbreviated as DM.

• Test data 2: Theoretical foundations of Computer Science by Wilhelmiinä Hämäläinen. The text material consisted of slides with bullets and lists and

(47)

6.2. DATA MATERIAL 39 very few topic sentences, a total of 164 pages. In the context of this chapter, this text material will be abbreviated as TFCS.

• Test data 3: The first 92 pages of Scientific Writing material by Wilhelmiinä Hämäläinen. This contained a similar writing style to the TFCS, but contained more topic sentences on each concept. In the context of this chapter, this text material will be abbreviated as SCIWRI.

• Human constructed maps: We had hand-drawn concept maps (drawn by the author) of the above test data. See appendices F, G and H. These maps were used as reference maps against which to compare the automatically constructed concept maps. The concepts and relations in the human constructed maps were regarded as ”relevant”.

Figure 6.1 shows the a count of all the concepts and relations obtained from the manually constructed concept maps for each test data respectively. From the human constructed concept maps, 61 concepts were counted of which 38 were compound concepts and 44 were nouns from the DM book. 73 relations were counted from the DM book concept map. There were 56 concepts of which 34 were compound concepts from the TFCS test data. 35 of the counted concepts were nouns and 61 relations counted from the TFCS concept map. From the SCIWRI human constructed map, a total of 102 concepts, of which 34 were compound concepts and 55 were nouns. 68 relations were counted from the SCIWRI test data.

Figure 6.1: Manually extracted concepts.

(48)

6.3 Test measures

In determining how well ACMC works, we calculated precision, recall and f-measure.

Precision and recall are related measures which capture different aspects of comparison. In the context of this chapter, precision is defined as the fraction of retrieved concepts that are relevant. In simpler terms, precision measures how well the ACMC weeds out what is not wanted. Recall is defined as the fraction of relevant concepts retrieved. Recall measures how well ACMC finds what is wanted. In many situations, the use of a single measure which combines precision and recall could be appropriate for comparisons. In this context, F-measure summarize how well the ACMC concepts match the human constructed map.

The formula used to calculate precision is as follows:

P recision= |HC∩P C|

|P C|

The formula used to calculate recall is as follows:

Recall = |HC ∩P C|

|HC| The formula used to calculate F-measure is as follows:

F −M easure= 2. P recision·Recall P recision+Recall .

Here, HCstands for the set of concepts or relations found in the human constructed concept maps andPCstands for the set of concepts or relations produced by ACMC.

(49)

6.4. RESULTS 41

6.4 Results

In this section, we present results obtained from the testing different components on the ACMC. We also present the results obtained from the human constructed concept maps. Later in the chapter, we make comparisons between the ACMC constructed concept maps and the manually constructed.

6.4.1 Overview of ACMC concept maps

The first step in the ACMC is to access and load the three text files needed. To test this component, we specified the three file names on the command line, and the ACMC loaded the files successfully.

We tested that the ACMC was able to extract words from a given text, excluding the stop words, and count their frequencies. The ACMC was able to identify plural and singular words in the text, counting their frequencies, and merging if they both occurred in the text. Figure 6.2 shows a section of the output produced by the ACMC, extracted concepts and their frequencies. The ACMC was able to extract potential relations, as shows by the sample of the results in figure 6.3.

Figure 6.2: Section of the results produced by the ACMC: Extracted words and their frequencies.

(50)

Figure 6.3: Section of the results produced by the ACMC: Potential relations and their absolute frequencies.

6.4.2 ACMC concept maps

Table 6.1 gives a summary of the results obtained after testing the ACMC with the given test data. From the table, it can be seen that ACMC extracted 689 words with

≥3, excluding stop words, from the DM book. Out of the extracted concepts, 148 were compound words and 232 were found to be nouns. Of words with frequency

≥ 10, 181 concepts were extracted, 19 were compound words and 102 nouns. 103 extracted concepts had frequency ≥15, of which 10 were compound words, and 68 were nouns. The ACMC extracted 138 potential relations of which had frequencies

≥7.

The TFCS data had the ACMC extract 916 words of which 277 were compound concepts, 244 nouns. Of all the extracted concepts, 251 extracted word, 30 compound words, and 97 nouns had frequency of ≥10. 191 extracted concepts, 15 compound words and 67 nouns had frequency of ≥ 15. The ACMC extracted 498 potential relations, with frequency ≥7 of which 19 were found to be sensible relations.

(51)

6.4. RESULTS 43

Table 6.1: Concepts extracted by the ACMC.

896 potential concepts were extracted when tested with the Scientific Writing material. Of these extracted concepts, 147 were compound words, 321 were nouns. 236 of the 897 extracted concepts, 5 were were compound words and 121 were nouns.

with frequency ≥10. 161 extracted concepts, 1 compound word and 72 nouns had frequency ≥15. 73 relations were extracted.

It was observed that more sensible single word concepts, compound concepts and relations had a higher frequencies. The highlighted rows in table 6.2 shows the most frequent sensible and non-sensible concepts and relations.

Comparison to Human drawn maps

A comparison between the manually constructed and ACMC constructed concepts maps was made, and the the results summarized in table 6.3. 32 concepts of which 14 were compound concepts, 24 nouns and no relations were found to appear in both the manually constructed concept map and the ACMC constructed concept maps, from the DM book test data. TFCS test data produced 31 concepts, of which 13 were compound concepts, 15 nouns and no relations in both the manually and ACMC constructed concept maps. Of the 62 concepts in both the manually and

(52)

Table 6.2: List of most sensible and non sensible extracted words.

(53)

6.4. RESULTS 45

Table 6.3: Concepts, compound concepts, nouns and relations that appear in both manual and ACMC constructed maps.

ACMC constructed concept maps, 13 were compound words, 32 to be nouns and no potential relations.

Precision, Recall and F-Measure

Precision, recall and F-measure of extracted concepts, compound words and relations for each test data was calculated at different thresholds. The results are presented in table 6.4

From the table represented in table 6.4, it can generally be observed that the precision is lower than recall, with a few exceptions. The exceptions occurred when calculating precision and recall of compound words with frequency of ≥ 7, apart from the TFCS test data, where the precision was higher with compound words of frequency of ≥ 10. It can also be observed, in most cases, that the precision of extracted concepts increased while the recall decreased, as the frequency threshold increased. A similar trend was observed with compound words and nouns.

The highest values of precision and recall were observed in within different thresholds, and different components(extracted concepts, compound words, nouns and relations) tested.

Generally, SW has the highest precision (1) at compound words of frequency ≥15 followed by 0.6 at compound words of frequency ≥ 10 of the same test data, and DM (0.5) at compound words of ≥15. DM has highest recall (0,6167) at extracted concepts of frequency ≥ 3 followed by TFCS (0.5818) at nouns of frequency ≥ 3 and DM (0.4211) at compound words of frequency ≥3.

The highest precision values of different components were observed at frequency of

(54)

Table6.4:Precision,recallandF-Measureofdifferenttestdata.

(55)

6.4. RESULTS 47

≥15, with relations being and exception. The highest recall values of all components tested were observed at frequency of ≥3.

For two of the test data, DM and TFCS, it can be seen that there was a tendency for the F-measure to increase with the increase in the frequency threshold for Extracted word and Compound words, except for the highest threshold, where the F-measure dropped. The F-measure decreased with the increase of the frequency threshold for the SW test data. The F-measure decreased with the increase of the threshold for Relations for all the test data. The highest F-measure (0.2857) was observed for the Compound words DM test data.

The frequency threshold affected the results in that, as the threshold increased, more relevant concepts were extracted as the number of retrieved concepts decreased.

The results depict better performance of the ACMC in retrieving compound concepts from the SW test data and extracting relevant concepts from DM test data.

Low precision and recall values observed in the relations component show that ACMC did not perform well in extracting relevant relations from the test data.

From the observed measures above, we can conclude that ACMC works better with DM test data than with TFCS or SW test data. This conclusion is based on the observation that DM test data has a better precision/recall combination and F-measure. DM test data produced reasonably high precision and recall values despite the values falling within different thresholds. SW test data produced the highest precision value (1), but did not offer a suitable recall combination. At this point, the best frequency threshold to be used in ACMC cannot be determined as the highest precision and recall values fall in different frequency thresholds.

Comparison to Leximancer

Each of the test data was run though Leximancer and the results can be seen from Figure 6.4. The Leximancer extracted 116 from the DM book. 154 concepts were extracted from the TFCS test data and 104 concepts were extracted from the Scientific Writing test data. See appendices A, B and C for results obtained from

(56)

Leximancer.

Figure 6.4: Concepts extracted from Leximancer.

Figure 6.4 summarizes the comparisons made between the manual concept maps and the Leximancer concept maps. It was observed that 18 concepts from the DM book appeared in both the manual concept maps and the Leximancer concept maps. The same number of concepts was observed in the TFSC material. 19 concept appeared in the manually and Leximancer constructed concepts maps for the Scientific Writing material.

Comparing the ACMC and Leximancer constructed concept maps led to the values displayed in Figure 6.4. DM book and TFCS test data had 85 and 101 concepts respectively appear in both the Leximance and ACMC extracted concepts .101 concepts from the Scientific Writing test data appear in the ACMC and the Leximancer constructed concept maps.

Leximancer can be seen more of a graphical tool rather than statistical, therefore comparisons of the relations extracted by the Leximancer could not be compared to the relations extracted by ACMC or human maps.

6.5 Discussion

The experimental results showed that there is a difference in the number of concepts extracted from the different test data. This can be attributed to the fact that the test data had different contents, from different courses. In addition, the test data were of different sizes. Another factor that contributes to the difference in the number of extracted concepts, varying between the different test data, is the writing styles used in the test data. Each test data had a different writing style used in delivering the contents.

(57)

6.5. DISCUSSION 49 It was observed that, comparably, when a higher threshold of ≥15 was used, more sensible concepts produced. This means that a larger number of sensible concepts lied above frequencies of ≥15. It was observed that a slightly lower threshold,>10 produced sensible compound concepts. We could explain these observations with the phrase: ”the more a concept appears in the text, the more relevant it is”. A higher threshold produced more sensible concepts. Based on the heuristics used, if concepts are significant in the given text material, the concepts have a tendency to occur frequently in the text. Other concepts in the text could occur, but less frequently. This could mean that the concepts could either be less significant or they are less inclusive concepts. More inclusive concepts occur frequently in a given text as they encompass general ideas in the text, therefore, mentioned frequently.

On the other hand, less inclusive concepts cover specific areas in a text, hence, occur less frequently in the text.

The results showed that a large number of extracted relations had frequencies of

≤ 3. A small number of extracted relations, which had frequencies of ≥ 3, were seen as sensible relations. As the extraction of potential relations between concepts relied on the extracted concepts, non-sensible concepts produced non-sensible relations. As the direction of the relationship is not indicated by the ACMC, relations between concepts e.g. ’automaton’ and ’finite automaton’, and ’finite automaton’

and ’automaton’ are identified as one relationship.

It is important to note that the basic method used by the ACMC to extract concepts and relations was based on term occurrence. No syntactic analysis or auxiliary ontologies were used. The ACMC was not able to identify features such as equations, tables, formulas and algorithms. Due to this, the ACMC was not able to differentiate text in these features from the normal text in the text material. This could explain why the ACMC produced some non-sensible concepts and relations.

Here, we refer to concepts and relations that are not relevant to the text material as non-sensible. For instance, parts of equations, misspelt words, non-English words (as some texts contained Finnish words) and verbs were considered as non-sensible