• Ei tuloksia

Adapting the smaller data to analysis

6.1 Statistical analysis

6.1.5 Adapting the smaller data to analysis

In the previous paragraphs we could see that the omission of the fourth class did indeed improve the suitability of the data for ANOVA analysis. However, it is not yet good enough in its entirety that we can run a two or three way ANOVA analy-sis on it – only one variable fulfills all of the necessary assumptions. Therefore it still seems prudent to try to work with the data more to see whether even stronger suitability can be achieved.

Performing the same battery of power transformations data as we did on the orig-inal data on this smaller set, we can check the tests of normality (Shapiro-Wilk) and the (Levene) tests of homogeneity of variance to see whether the situation has improved. A summary of the results are presented in appendix 7.

The Year variable does not really need improvement as it is already acceptable in its original form, but we can observe that it shows its strongest results on the tests of normality under the square root transformation. In terms of homogeneity of variance, we see that almost all transformations produce acceptable results (ex-cept for the quadratic transformation) but the strongest results are had under the reciprocal quadratic transformation.

The Class variable on the other hand is a more complicated case. The third class, the mid-price v-motor models, can simply not be made more normal in its distri-bution by application of any of these power transformations. Under some of the

transformations, the results of the other two classes are improved, however. The homogeneity of variance is acceptable at the 5% level of significance in the origi-nal form only.

Finally we have the Region -variable. According to the tests of normality this group is normally distributed under most of the transformations including the original. However, the tests of homogeneity of variance indicate that this variable does not really show equality of variance between groups, not under any trans-formation. This is at least true if we strictly adhere to the 5% level of significance -rule; a few of the reciprocal transformations do show results that gets close to this level. It may be possible to perform an ANOVA analysis on this variable if we look more closely at the robust tests of equality of means, Brown-Forsythe and Welch.

Further editing of the data

As we could see, the transformations resulted in certain improvements in some cases: the regional variable should now be passable for ANOVA analysis, as is.

The class variable does still not fulfill the assumptions of such analysis, however.

So is there any way we could continue to improve on the data and fulfill the as-sumptions?

Yes and no. The class variable showed clear indications of being an artificial var-iable. Especially it seems the division between classes two and three – midprice in-line motor models and midprice v-motor models – is ill-fitting and that the two classes could arguably be combined to form a more naturally distributed sub-group. Doing so would of course not affect the results of the other two independ-ent variables year and region – but it might improve the class variable.

However, there would be some negative effects from all this too. In real world terms we would be combining two classes of cars that simply do not mix well – not unless we do a serious re-evaluation of all models in the two old classes. The cheapest in-line midprice cars are probably not wise to compare with the most expensive v-engine models, not unless we are seriously prepared to bend the cri-teria on which we made the selection. We would then have to reconfigure the classes more seriously and discard the cheapest and most expensive cars off each original class to form a believable new entity.

Secondly, we would knowingly be ignoring the fact that there is a systematic dif-ference in the recorded time measurements between the two classes. Some of the sub-assemblies are strongly affected by the motor type. Exhaust manifold, alter-nator, the removal and placement engine itself, the type of transmission (which is

now chosen on the basis of motor/class) – while these differences are taken into consideration by the current class division they would simply be hidden in the new class. A small difference but nevertheless a systematic error.

Besides the option of combining the midprice classes, one could also see whether manual deletion of certain outliers would help improve normality of class varia-ble. This was even tested: having printed up a list of the lowest and highest outli-ers to be found in the class variable and deleting 10–15 possible candidates, the data was explored again. Unfortunately, this simple method did not produce any positive results in the terms of normal distribution. The homogeneity of variance did of course improved somewhat, but this was not expressly necessary.

To go further into the issue and start deleting specific observations that hinder the data from displaying a normal distribution not be a scientifically trustworthy op-tion. Transformation of the data is approved since it does not specifically change the internal relations between observations, but the manipulation of data by con-scious deletion of observations simply because they do not fit one’s expectations is unethical research methodology.

Nevertheless, we could probably improve the statistical validity of the data by recombining the two mid-price classes, leaving us with only the low-price cars and the mid-price cars. Furthermore, when looking at the region variable, it is obvious that the European group stands out quite strongly as a candidate to be dropped from the analysis. The European group is only a fraction in size of the Asian and North American groups, and is thereby making the comparison less reliable just by being present.

These are both valid choices – they would leave us with a smaller data set (about 200 models) but would produce a more statistically significant comparison. But narrowing the scope of the comparison so much would feel irritating – better to see what the (admittedly less dependable) results of the wider and more interna-tional data set can tell us first. Cutting the data further can an option for later.

6.1.6 2nd ANOVA analysis of the Year variable

In appendix 8 can we see the ANOVA analysis for the smaller set of data, with the fourth class omitted. We perform this analysis on the untransformed data set – in appendix 7 we went through the normal power transformations and while the original data did not display the absolutely highest significance numbers on the tests of normality and homogeneity of variance, it did get completely acceptable

results on both. And since the untransformed figures still have a connection to real life applications, it is best to use them.

Since both assumptions are fulfilled, the normal test statistic will work. Also on this smaller set of data the result is a conclusive rejection of the null hypothesis:

the year groups have different means on a statistically significant level. This is not surprising as the ANOVA analysis on the larger but transformed dataset showed that same result. The only difference can be seen in the post-hoc tests – the differ-ent groupings now seem more conclusive. The first three years (1990, 1995, 2000) are similar enough in their means that they form one unit, while the later two years (2005 and 2010) form two separate groups. Repair times are conclu-sively going up with time and even more so in later years.

ANOVA F(4, 251) = 20,584, Sig.: 0,000