• Ei tuloksia

A custom, high resolution dataset that records economic, environmental, and infrastructural characteristics of the studied urban areas was developed by the researcher during the first year of the dissertation and first presented in article I. The dataset’s components are high quality registered data from official sources. The quality, resolution, and comprehensive nature of the dataset render it internationally high-level data. Despite the abovementioned qualities, the dataset development process has been extensive: the geocoding, data overlay/combining, quality checking, and format conversion and storing operations consumed considerable resources. This step has been crucial for the research, and the effort and challenges involved in developing this kind of interdisciplinary spatial-temporal microdata is often overlooked by analysts who are not involved in data preparation tasks. This Section introduces the thesis’ data, study areas, and overall data processing and analysis workflow. Ethical considerations in data handling and analytical research are also discussed.

6.1. Study areas, empirical datasets, and overall research workflow

The thesis relied on empirical data from the housing markets of Helsinki’s urban region, Pori, and Rovaniemi, and on geospatial data of land use, topography, the building stock, and infrastructure.

These data were used in conjunction or fully merged with each other in order to produce a custom-made dataset. Helsinki region is the capital region of Finland with a population of approx.

1,116,000. Its constituent municipalities are those of Helsinki, Espoo, Vantaa, and Kauniainen. The extended metropolitan area of Helsinki, Greater Helsinki, has a population of approx. 1,500,000 and includes a few additional municipalities. Pori is a city at the west coast of Finland with approx.

85,000 residents (approx. 140,000 inhabitants in its broader urban region). Rovaniemi is a city at the north of Finland with approx. 60,000 inhabitants. Figure 7 displays the three urban regions.

Figure 7: Pori (left), the Finnish capital region (center), and Rovaniemi (right). The insert maps locate the NUTS-3 regions of the three urban areas inside Finland.

The key dataset of the research is a proprietary time series (1970-2011) of housing transactions, acquired and licensed from the Technical Research Institute of Finland Ltd (VTT). The data record the selling price, list/sale dates, address, and structural attributes of a sample of sold properties. The data are voluntarily gathered by participating real estate brokers and are assembled, quality-checked, and maintained by VTT. Beyond the real estate dataset, extensive use has been made of

3 0 3

the Finnish National Land Survey’s land use dataset (SLICES –pictured in Figure 7 as a greyscale image) and topographic database (Maastotietokanta –its building stock component is pictured in Figure 6). The former is a 10 by 10 meters raster representation of land use in Finland and the latter a vector representation of the man-made and natural landscape of Finland at the scale of 1:10000.

Additionally, GIS versions of official flood risk maps for various Finnish cities were provided by the Finnish Environment Institute. Various auxiliary data complemented the analysis, notably variables from the national and regional economic accounts by Statistics Finland and EUROSTAT.

Figure 8 displays the general workflow of data, preprocessing, and analysis. Land use and real estate information were merged to produce hybrid socioeconomic-biogeophysical datasets. The housing transaction data were georeferenced by using the properties’street addresses and stored as point features in a GIS dataset. The attribute table of the transaction points (i.e. the original, non-spatial hedonic attributes) was expanded to include proximities to various land uses, services, infrastructure, and topographical features. Similarly, the flood risk maps were used to categorize properties into flood-safe and flood-prone classes and to subsequently analyze this categorization in connection to housing prices and hedonic attributes. Lastly, the land use and topographical data were used in developing a multitemporal dataset of land uses and the transport network. Zoning and growth constraints were derived from land use data and planning agencies, whereas topographical parameters were derived from the National Land Survey’s 10-meter digital elevation model.

Figure 8: Main data processing and analysis workflow of the thesis

The GIS data are stored in the ETRS EUREF-FIN projected coordinate system, which is the current official coordinate system of Finland and complies with the EU INSPIRE directive. A small portion of data is stored in the formerly official YKJ (KKJ zone 3) projected coordinate system. The spatial analysis prioritized data that have EUREF-FIN as their native coordinate system. In a few cases

Non-spatial hedonic

6. Data and ethical issues

A custom, high resolution dataset that records economic, environmental, and infrastructural characteristics of the studied urban areas was developed by the researcher during the first year of the dissertation and first presented in article I. The dataset’s components are high quality registered data from official sources. The quality, resolution, and comprehensive nature of the dataset render it internationally high-level data. Despite the abovementioned qualities, the dataset development process has been extensive: the geocoding, data overlay/combining, quality checking, and format conversion and storing operations consumed considerable resources. This step has been crucial for the research, and the effort and challenges involved in developing this kind of interdisciplinary spatial-temporal microdata is often overlooked by analysts who are not involved in data preparation tasks. This Section introduces the thesis’ data, study areas, and overall data processing and analysis workflow. Ethical considerations in data handling and analytical research are also discussed.

6.1. Study areas, empirical datasets, and overall research workflow

The thesis relied on empirical data from the housing markets of Helsinki’s urban region, Pori, and Rovaniemi, and on geospatial data of land use, topography, the building stock, and infrastructure.

These data were used in conjunction or fully merged with each other in order to produce a custom-made dataset. Helsinki region is the capital region of Finland with a population of approx.

1,116,000. Its constituent municipalities are those of Helsinki, Espoo, Vantaa, and Kauniainen. The extended metropolitan area of Helsinki, Greater Helsinki, has a population of approx. 1,500,000 and includes a few additional municipalities. Pori is a city at the west coast of Finland with approx.

85,000 residents (approx. 140,000 inhabitants in its broader urban region). Rovaniemi is a city at the north of Finland with approx. 60,000 inhabitants. Figure 7 displays the three urban regions.

Figure 7: Pori (left), the Finnish capital region (center), and Rovaniemi (right). The insert maps locate the NUTS-3 regions of the three urban areas inside Finland.

The key dataset of the research is a proprietary time series (1970-2011) of housing transactions, acquired and licensed from the Technical Research Institute of Finland Ltd (VTT). The data record the selling price, list/sale dates, address, and structural attributes of a sample of sold properties. The data are voluntarily gathered by participating real estate brokers and are assembled, quality-checked, and maintained by VTT. Beyond the real estate dataset, extensive use has been made of

3 0 3

the Finnish National Land Survey’s land use dataset (SLICES – pictured in Figure 7 as a greyscale image) and topographic database (Maastotietokanta – its building stock component is pictured in Figure 6). The former is a 10 by 10 meters raster representation of land use in Finland and the latter a vector representation of the man-made and natural landscape of Finland at the scale of 1:10000.

Additionally, GIS versions of official flood risk maps for various Finnish cities were provided by the Finnish Environment Institute. Various auxiliary data complemented the analysis, notably variables from the national and regional economic accounts by Statistics Finland and EUROSTAT.

Figure 8 displays the general workflow of data, preprocessing, and analysis. Land use and real estate information were merged to produce hybrid socioeconomic-biogeophysical datasets. The housing transaction data were georeferenced by using the properties’ street addresses and stored as point features in a GIS dataset. The attribute table of the transaction points (i.e. the original, non-spatial hedonic attributes) was expanded to include proximities to various land uses, services, infrastructure, and topographical features. Similarly, the flood risk maps were used to categorize properties into flood-safe and flood-prone classes and to subsequently analyze this categorization in connection to housing prices and hedonic attributes. Lastly, the land use and topographical data were used in developing a multitemporal dataset of land uses and the transport network. Zoning and growth constraints were derived from land use data and planning agencies, whereas topographical parameters were derived from the National Land Survey’s 10-meter digital elevation model.

Figure 8: Main data processing and analysis workflow of the thesis.

The GIS data are stored in the ETRS EUREF-FIN projected coordinate system, which is the current official coordinate system of Finland and complies with the EU INSPIRE directive. A small portion of data is stored in the formerly official YKJ (KKJ zone 3) projected coordinate system. The spatial analysis prioritized data that have EUREF-FIN as their native coordinate system. In a few cases

Non-spatial hedonic

where data could be acquired only in the YKJ system, re-projections were performed. The error in these transformations is insignificant for the type of economic processes studied by this research.

A particularity of spatial analysis with housing transaction data is the handling of what seems to be duplicate observations. These duplicates refer to points with exactly the same coordinates, but which correspond to factually different market transactions. These cases result from multiple market transactions involving either repeatedly the same dwelling during its lifecycle, or multiple properties (for instance, apartments) at the same address. These duplicate points cannot be handled appropriately by the utilized spatial econometric tools, so a procedure was established that preserves the duplicates while displacing them by (i.e. moving them apart) a few centimeters. In practice, the distance is so insignificant that this has no effect for the modelled mechanisms, since the mechanisms have been captured in the scale of a few to tens or hundreds of meters.

6.2. Data privacy and ethical issues

Even though the data used for the dissertation research do not contain sensitive information, there are some risks involved in handling the data. Each real estate transaction record contains the address of the sold property. In theory, combining the address and date of transaction, it is possible to identify the household(s) involved in the transaction, revealing potentially sensitive information.

Such information may include, for instance, the various financial and structural characteristics of the sold property. No names, contact information, or identification numbers are included in the dataset. Neither does the dataset include demographic or financial information about the seller and buyer. However, technically skilled analysts with several spatial datasets at their disposal could pinpoint the particular demographic characteristics of the particular neighborhood in which the transaction happened. Unless multiple security breaches take place in several organizations that handle interconnected datasets, it is not possible to identify particular individuals.

Due to the above considerations, and in coordination with the real estate data proprietor (VTT Ltd), the research took four precautions to protect the data. Firstly, the data are stored in a password-protected device not accessible to the intranet or internet. Geocoding, which requires internet access, was performed by removing all the information from the data and keeping only the address and a custom-made unique identifier for each transaction. Once those records were geocoded into points, they were re-joined offline to their vital information. Secondly, analysis involving the data is conducted with no access to the internet. In both cases above, the computer processing the data is additionally protected by a firewall that is part of the Finnish government’s ICT infrastructure. Thirdly, the quantitative and qualitative results are communicated as aggregate results, usually referring to no less than a few hundred observations. Although the minimum level of aggregation was agreed with VTT Ltd at eight observations, the results have not discussed such a low level of aggregation. The presentation of the research results is not specific enough to enable one to deduce sensitive information about particular housing properties. Similarly, maps, figures, and images do not display identifiable disaggregate points. Lastly, all published results are reviewed by the data proprietor and approved as keeping the formal security and ethical requirements.

The rest of the datasets involving descriptions of the physical, natural, and social environment do not involve information that is considered sensitive or harmful to individuals. All of these data are in the public domain, of secondary nature, and have been handled extensively by responsible agencies and organizations in Finland or other European countries before being downloaded and used in the present research. Aside from data privacy in relation to secondary datasets, the research did not involve collection of primary data and information or other research interaction with human subjects or non-human species. Full credit has been given to the sources of data, theories, methodologies and other materials via the academic articles in which they were used.

There are less technical and more philosophical ethical considerations related to possible misuse of the conducted research by third parties. For instance, how can one ensure that a recommendation for honoring the economic benefits of agglomeration is not misinterpreted as an implicit suggestion that the economy is prioritized over the environment? Conversely, how can one ensure that a criticism to market mechanisms that deteriorate urban ecosystems is not misread as an activism statement? Such inquiries are too abstract in their nature to be a scope of technical research such as the present dissertation; they belong rather to the domain of theoretical sciences. However, in accordance with best ethical practices, the dissertation research has been informed about these broader ethical issues and they have served as guidance in conceptualizing, interpreting, and presenting the quantitative results. The general stance towards such issues in the thesis is this: the sensible use of the presented analytical results should be considered as a main social responsibility issue. As mentioned throughout this text, the results of this thesis should be considered as parts of a wider array of problems, phenomena, and objectives. The mere fact that real estate prices rise or decline as the result of changes in the physical or social environment must not be taken in isolation; by itself, it conveys no meaning and no policy recommendation should be made based on this fact alone.

6.3. Unavailable data

The study would have benefited from a longer time series of land use and infrastructure maps of the study areas. Public high resolution land use data extend from 2000 to 2012, while the property transaction records that were available for this research extend from 1970 to 2011. This meant that hedonic valuations could only estimate the shadow prices of ecological attributes starting from approximately 1995, whereas the records from 1970-1995 could not be fully utilized. The implication is that 25 years of temporal variation in the implicit prices of risks and amenities could not be retrieved. The availability of these data would have enabled the estimation of demand determinants for risks and amenities in articles I-III. In addition to an extended times series of land use maps, article III would have benefited from property transaction records after 2011. This would have enabled a longer tracking of market responses to risk-related shocks, providing information, among others, on memory effects and risk information decline in the housing market.

where data could be acquired only in the YKJ system, re-projections were performed. The error in these transformations is insignificant for the type of economic processes studied by this research.

A particularity of spatial analysis with housing transaction data is the handling of what seems to be duplicate observations. These duplicates refer to points with exactly the same coordinates, but which correspond to factually different market transactions. These cases result from multiple market transactions involving either repeatedly the same dwelling during its lifecycle, or multiple properties (for instance, apartments) at the same address. These duplicate points cannot be handled appropriately by the utilized spatial econometric tools, so a procedure was established that preserves the duplicates while displacing them by (i.e. moving them apart) a few centimeters. In practice, the distance is so insignificant that this has no effect for the modelled mechanisms, since the mechanisms have been captured in the scale of a few to tens or hundreds of meters.

6.2. Data privacy and ethical issues

Even though the data used for the dissertation research do not contain sensitive information, there are some risks involved in handling the data. Each real estate transaction record contains the address of the sold property. In theory, combining the address and date of transaction, it is possible to identify the household(s) involved in the transaction, revealing potentially sensitive information.

Such information may include, for instance, the various financial and structural characteristics of the sold property. No names, contact information, or identification numbers are included in the dataset. Neither does the dataset include demographic or financial information about the seller and buyer. However, technically skilled analysts with several spatial datasets at their disposal could pinpoint the particular demographic characteristics of the particular neighborhood in which the transaction happened. Unless multiple security breaches take place in several organizations that handle interconnected datasets, it is not possible to identify particular individuals.

Due to the above considerations, and in coordination with the real estate data proprietor (VTT Ltd), the research took four precautions to protect the data. Firstly, the data are stored in a password-protected device not accessible to the intranet or internet. Geocoding, which requires internet access, was performed by removing all the information from the data and keeping only the address and a custom-made unique identifier for each transaction. Once those records were geocoded into points, they were re-joined offline to their vital information. Secondly, analysis involving the data is conducted with no access to the internet. In both cases above, the computer processing the data is additionally protected by a firewall that is part of the Finnish government’s ICT infrastructure. Thirdly, the quantitative and qualitative results are communicated as aggregate results, usually referring to no less than a few hundred observations. Although the minimum level of aggregation was agreed with VTT Ltd at eight observations, the results have not discussed such a low level of aggregation. The presentation of the research results is not specific enough to enable one to deduce sensitive information about particular housing properties. Similarly, maps, figures, and images do not display identifiable disaggregate points. Lastly, all published results are reviewed by the data proprietor and approved as keeping the formal security and ethical requirements.

The rest of the datasets involving descriptions of the physical, natural, and social environment do not involve information that is considered sensitive or harmful to individuals. All of these data are in the public domain, of secondary nature, and have been handled extensively by responsible agencies and organizations in Finland or other European countries before being downloaded and used in the present research. Aside from data privacy in relation to secondary datasets, the research did not involve collection of primary data and information or other research interaction with human subjects or non-human species. Full credit has been given to the sources of data, theories,

The rest of the datasets involving descriptions of the physical, natural, and social environment do not involve information that is considered sensitive or harmful to individuals. All of these data are in the public domain, of secondary nature, and have been handled extensively by responsible agencies and organizations in Finland or other European countries before being downloaded and used in the present research. Aside from data privacy in relation to secondary datasets, the research did not involve collection of primary data and information or other research interaction with human subjects or non-human species. Full credit has been given to the sources of data, theories,