• Ei tuloksia

2.6 Methodological background

2.6.4 Choosing the variables

It is common that when conducting a hedonic price model for housing markets, the variables are classified into groups such as environmental, locational, socio-economic, and structural variables. The aim and extent of the research reflect the choice of the variables. The prevalent factor in several studies seems to be data sources instead of theoretical reasons.

Whereas it is problematic to choose the individual variables. As the variables are chosen from the available data, there are problems connected to limitations of data sources and difficulty to get reliable data as well as problems related to the definition, measuring, and qualification of variables, for instance, the environment or socio-economic structure of the neighborhood. (Laakso 1997, 46)

Another problem is multicollinearity, which was already mentioned in previous chapter 2.6.3 as one of the assumptions for OLS. Correlation between explanatory variables exists in regression analysis but if the correlation is too strong it leads to multicollinearity. The multicollinearity problem might appear if the correlation between explanatory variables is over 0,9. However, it is not always possible to observe multicollinearity problems from the correlation between variables. One way to measure multicollinearity is the Variance Inflation Factor (VIF). (KvantiMOTV 2003) VIF value of 10 indicates that the variance of the specific regression coefficient is 10 times higher than it would have been if the specific variable did not have a strong dependency with the other independent variable in the model. Often 10 has been used as a limiting value in the literature that multicollinearity problems occur.

However, the VIF values should be evaluated in the context of other factors that affect the specific variable. The effects can decrease the variance of the regression coefficients even

28

when VIF is even 40 or more. When VIF has high values, one option is to respecify the model by removing one or more of the independent variables that suffer from multicollinearity. However, sometimes respecifying might do more harm than improvements to the model. (O'Brien 2007, 683-685)

If several variables explaining the variation of housing prices are multicollinear with each other, it weakens the econometric analysis. If multicollinearity problems are ignored in the empirical analysis, estimates of parameters are inconsistent and it results in unreliable test statistics. On the other hand, the risk of misspecification of the model exists, if the model is reduced and simplified too much. (Laakso 1997, 46-47) In the misspecification of variables, the model might be over-specified as an irrelevant independent variable is included or under-specified if the relevant independent variable is excluded from the model. (Chin & Chau 2003) According to Puumalainen (2019), if the model is only used for predicting and the coefficients are not interpreted, multicollinearity will not cause large damage.

To measure how well the regression model fits the data, there are measures known as the goodness of fits statistics. R2 is the most common goodness of fit statistic. R2 expresses a square of the correlation between the values of the exploratory variable and corresponding predicted values from the model. As the correlation coefficient is expressed on a scale of -1 to 1, due to that R2 is expressed between 0 and 1. If the model fits the data well, the R2 value is close to 1 and if it is close to 0, the model is not fitting to the data. The problem with R2 is that it does not decrease if more variables are added into regression and the R2 value will be at least as high for the updated regression as it was for the previous one. This makes it difficult to use R2 as a determinant measure of whether the added variable should be included in the model or not. A modification to pass these problems can be done to R2, which considers the loss of degrees of freedom related to adding more variables. Adjusted R2 can be used to make a decision whether the added variable should be included in the model or not. If the value of adjusted R2 increases, the variable should be included, and if it decreases it should not be included. (Brooks 2014, 151-155)

Outliers are observations that are away from the most of the observations on one variable or a combination of variables (Hutcheson & Sofroniou 1999, 19). These single divergent observations might have an effect on the results of regression analysis (KvantiMOTV 2003).

Outliers are points in the data that do not fit in the pattern and are away from the fitted model

29

(Brooks 2014, 690). There are several reasons for why outliers exist as for example, data might include misspellings or problems with missing values. For checking whether there are outliers in the data, it is possible to use graphical methods such as histograms, box plots or normal probability plots. (Hutcheson & Sofroniou 1999, 19-20) One way to improve the model is to remove the outliers. When outliers are removed, standard errors and residual sum of squares are reduced and therefore R2 increases, which means better fit of the model for the data. Outlier observations are away from the rest of the observations and do not fit in the pattern of the rest of the data. If there are outliers in the data, it might have significant effect on the coefficient estimates in OLS. OLS tries to minimize the distances between points. When there are points far away from the fitted line and the residual, the distance from the outlier point to the fitted line, is squared, it leads to increase in the RSS. Even though, each observation represents a valuable part of information, there is a benefit to remove outlier observations that could have an excessive effect on the OLS estimates.

(Brooks 2014, 211-213)

30

3

DATA AND METHODOLOGY

In chapter 3.1, the dataset and some descriptive statistics are presented. The chapter also includes an analysis of the housing price development on the timeframe of 2009-2019 to get a better insight, how the prices have developed in different areas during the different stages of Westmetro's construction. Chapter 3.2 includes sub-chapters where the data is tested, whether the OLS estimator is BLUE, regression models for different datasets are conducted and performance of the conducted models are compared and finally the model with the highest prediction accuracy is chosen for the actual predictions. As this study aims to predict the housing price development for the time Westmetro’s phase 2 starts operating and the estimated time for Westmetro’s phase 2 to start operating is in 2023, the effect of start of the operating cannot be find out based on data from 2009-2019 from phase 2 areas. A separate regression model for phase 1 areas is conducted, to get the effect of the metro's start to operate and equation for phase 2 prediction is formed by combining the chosen regression model and the effect of the metro's start to operate from phase 1 regression model. Towards the end of the chapter, sub-chapters present the observations that are used to create the predictions for the years 2020-2023 as well as two options for alternative predictions.