• Ei tuloksia

Research methods

3.3 Data set preparation

3.3.3 Data transformation

For various motivations that will be outlined soon, all the data of both the categories of dependent and independent variables had to be transformed before proceeding with the data analysis. With regards to the dependent variables, their residuals followed strong positively skewed distributions, which are characterised by a long right tail, and their spread changed systematically with the values of the dependent variable, a statistical condition of the data named heteroscedasticity. The Jarque-Bera test, established by Jarque and Bera (1980), was employed to test whether the original data sample retained the same skewness and kurtosis as the normal distribution, which has respective values that are equal to 0 and 3, based on the null hypothesis of the residuals being normally distributed [22]; to be speci󰎓c, the Jarque-Bera test statistic is de󰎓ned as:

󱕉B= N 6

󰀕

W2+ (K −3)2 4

󰀖

(3.27)

The execution of the test on the original dependent variables provided the results shown in the following table, using the variables for the year 2014 as an example:

Variable χ2 DF p-value

RHIOAP14 384.58 2 < 0.00000000000000022 RHIDAP14 52.132 2 0.000000000004783

RHEOAP14 29.68 2 0.0000003591

RHEDAP14 39.349 2 0.000000002854

Table 3.2:Jarque-Bera test for the original dependent variables (2014)

As can be seen, the results undoubtedly con󰎓rmed the presence of heteroscedasticity of residuals for every dependent variable, which always represents a problem for linear regression analysis with the ordinary least squares methods, because it violates one of the assumptions on the homoscedasticity of residuals, and therefore needs to be solved before continuing with the data analysis.

In this situation, the dependent variables were log-transformed by taking their nat-ural logarithms, to obtain residuals that were approximately symmetrically distributed and to remove their systematic change in spread, roughly achieving the opposite stat-istical assumption of homoscedasticity, so that it could have been possible to correctly conduct the analysis. In fact, this type of transformation will permit the execution of all the statistical tests, which depend upon the assumption of normality of the residuals.

The natural logarithmic transformation is often used in the󰎓elds of statistical analysis and social sciences since it is a simple process and, as Gelman and Hill (2006) suggested,

“coe󰎏cients on the natural-log scale are directly interpretable as approximate propor-tional di󰎎erences: with a coe󰎏cient of 0.06, a di󰎎erence of 1 in x corresponds to an approximate 6% di󰎎erence in y, and so forth” [17, 60-61]. For the data analysis in this research, the letter “L” at the end of the name of a variable indicates the execution of this logarithmic transformation procedure. The following table illustrates the results of the Jarque-Bera test for the new set of dependent variables resulted from the execution of the logarithmic transformation, using the variables for the year 2014 as an example:

Variable χ2 DF p-value RHIOAP14L 1.4939 2 0.4738 RHIDAP14L 4.1254 2 0.1271 RHEOAP14L 2.0297 2 0.3625 RHEDAP14L 2.2686 2 0.3216

Table 3.3:Jarque-Bera test for the log-transformed dependent variables (2014) These outcomes indicate that the assumption of homoscedasticity of residuals has been satis󰎓ed, enabling to count on the results of the data analysis. This achievement is also con󰎓rmed with histograms, that show the distribution of a continuous variable and are used to determine if the values of each dependent variable are normally distributed, and probability plots, which represent the residuals of the data against the expected order statistics of the standard normal distribution and indicate negative or positive skewness depending upon showing curvatures with downward or upward concavity.

The following histograms and probability plots (Q-Q plots) illustrate the e󰎎ects of the logarithmic transformation on the distribution of the dependent variables and their residuals, using the variable RHIOAP14 as an example:

RHIOAP14 Figure 3.8:Logarithmic transformation of the dependent variable RHIOAP14

−2 −1 0 1 2 Figure 3.9:Q-Q plots of residuals for RHIOAP14 and RHIOAP14L

The histograms illustrate that the original dependent variable followed a positively skewed distribution, while the log-transformed one is normally distributed. Moreover, the Q-Q plots show that the residuals of the original dependent variable followed a pos-itively skewed distribution, while those of the log-transformed one are normally distrib-uted, as indicated by the upward concavity in the󰎓rst plot and the loose adherence to a straighter line at a 45° upward angle in the second plot. The logarithmic transformation has not altered the values of the data and the interpretation of the analysis results will just need to follow the guideline outlined for the natural logarithmic transformation.

Concerning the independent variables, a mean-centring procedure was executed to diminish the collinearity between them, avoiding problems of in󰎐ated multicollinearity indicators that could have wrongly questioned the selection of independent variables for the analysis with the statistical models. The procedure involved the subtraction of the mean from the values of each respective independent variable, which resulted in their centring around zero. In this case, the procedure has not a󰎎ected neither the inherent meanings of the data nor any characteristic of the independent variables, such as the standard deviation and skewness. For the data analysis in this research, the letter “C”

at the end of the name of an independent variable indicates the execution of this mean-centring procedure. The following histograms depict the results of the mean-mean-centring procedure on the distribution of the independent variables, using the variable BedOAR14 as an example:

BedOAR14

x

Frequency

10 15 20 25 30 35 40 45

0102030

(a)Distribution of BedOAR14

BedOAR14C

x

Frequency

−15 −10 −5 0 5 10 15 20

010203040

(b)Distribution of BedOAR14C Figure 3.10:Mean centring of the independent variable BedOAR14

The section “Data transformation” of the Appendix A named “Data set preparation”

contains additional histograms and Q-Q plots that illustrate the e󰎎ects of the logarithmic transformation on the distribution of all the dependent variables and their residuals, as well as the outcomes of the mean-centring procedure on the distribution of all the independent variables, for all the years taken into account for the data analysis.

Chapter 4