• Ei tuloksia

Methodological considerations

The key concept of data mining is discovery, commonly defined as “detecting something new”. The actual data mining task is the exploration of large amounts of data, corresponding to several factors, to extract interesting patterns such as

clusters, unusual records, or dependencies. These patterns can then be seen as a kind of summary of the input data, and used in further analysis. A widespread method in data mining is the classification tree, which aims at predicting the value of a target variable on the base of several input variables. The tree is algorithmically constructed (using a computer package) by computing, for each factor to be considered, the information gain (w.r. to the target variable) given by splitting the initial population into two groups at some threshold value. Once found the splitting value which maximizes the information gain, the program explores the next factor, until all factors have been tested. The population is then split according to the factor (and the split value) corresponding to the highest information gain found. At this point, the full process is repeated (including the factor already used for the first splitting) for each of the obtained subgroups, and so on until either (i) each of the resulting subgroups contains only individuals having the same value of the target variable, or (ii) further splitting does not yield significant information gain. The best way to illustrate the method is to show how it works on data, as we shall do in the next section.

Classification tree construction

Our data cover 162 undergraduate math students at the University of Torino, all enrolled in 2010/11. We shall first consider the number of passed exams (within the first year) as the target variable. In the classification tree construction, the target variable should assume the least possible number of distinct values, to avoid overfitting. Therefore, we had to find a reliable way to “count” the earned credits, then we had to split the range of values at a significant threshold level, so to obtain only two classes for the target variable. Following the 1999 Bologna Accords, European university courses are described in terms of ECTS credits.

One ECTS corresponds, in principle, to 25 hours of study; each academic year includes 60 ECTS. In previous studies on math undergraduate curriculum in Turin University, we could observe that a critical ECTS threshold in the first year is 21: students earning less than 21 ECTS in the first year very seldom get the final degree. Accordingly, in the subsequent analysis we shift from description of what happened (number of ECTS earned) to prediction of what is likely to happen (career). Thus, we say that ECTS1≥21 “predicts success” and, conversely, that ECTS1<21 “predicts dropout”. At the moment, we know also the list of second-year students in 2011/12, and therefore the actual dropout incidence after the first year (a number of students, in fact, give up their studies at a later stage).

We apply the classification tree method to single out which variables “characterize”

the two groups (ECTS1≤21, ECTS1>21). We have at our disposal up to 27 input variables concerning: personal information that can be read in terms of social

aspects (for instance, living in a big or in a small town, being a commuter and so on); psychological traits and motivations (as emerged from the answers to the

“affective” part of the test); data from students’ previous career (diploma grades and type); the performance in the non-selective entrance test (TARM); for each student, we know all scores, credits and examination data for the University first-year courses, but in connection to dropout we considered only the total amount of ECTS obtained in scored exams. The construction of the classification tree is controlled by a number of parameters, such as the list of factors to be used and the minimum information gain to be considered for a split. The “best” model should be a compromise between the maximum overall predictive power and the minimum number of factors and splits needed (a fully predictive tree with too many nodes is likely to be overfitting).

Figure 1 shows a classification tree which gives a correct prediction rate of 92%, using only 9 factors. The variable yielding the greatest information gain is T2, the score in the second part of the test TARM (mostly assessing the comprehension of texts taken from math and physics textbooks, in Italian and in English): the first part of TARM, T1, is the same for all the undergraduate courses in the Faculty of Sciences and assesses basic mathematical skills, while T2 is considered as “specific” for the math curriculum.

Figure 1. classification tree with ECTS1 as predicted variable.

The T2 score ranges from 0 to 30, and the split value determined by the algorithm (14.5) shows that the test was well balanced. Students on the right branch

(scoring more than 14.5) are subsequently discriminated by “prc”, that is the perceived sense of responsibility (Savickas et al., 2009): if the latter is not very high (prc<60.44), then the next split is relative to the factor smt2 (possibility to observe and imitate effective models). If smt2<62.95, it predicts “success” (ECTS1>21).

If prc is very high, instead, the variable adp4 (the ability of establishing positive relationships and cooperating with others) intervenes. Notice that all measures of affective factors have been rescaled so that 50±10 correspond to the mean ± one standard deviation (observed for a suitable reference population).

The digits 0 and 1 at the bottom of terminal branches mean that the tree prediction for that branch is ECTS1<21 or ECTS1≥21, respectively. For each terminal group, we have indicated how many individuals of the original population are correctly classified (“T”), and incorrectly classified (“F”) by the tree.

Going back to the root branching on T2, the variable with the greatest information gain on the left branch (i.e., for students who scored less than 15 in T2) is the undergraduate curriculum: “MAT” (traditional math curriculum) versus “MFA”

(applied math for finance and insurance). Here, we are not regarding the choice among the two curricula as an achievement factor: however, this datum should be included because reaching 21 ECTS may have a different significance in the two curricula. It turns out that the difference affects only students with a low T2 score. Among these, MFA students are further classified by variables adp2 (inclination to consider oneself as responsible for his own professional future), and st4 (writing ability). As regards MAT students with low T2, the variable smt2 plays again a fundamental role.