• Ei tuloksia

4. RESEARCH APPROACH

4.1.1. Data

The long-run patterns of income inequality are commonly studied by utilising income tax records or social tables, wages and ratios, such as land rents and wages (Milanovic, Lindert, and Williamson 2011; Prados de la Escosura 2008; Humphries and Weisdorf 2015; Atkinson and Piketty 2010, 2007;

Lindert 2000). In this study, I utilised income tax records as the main source for assessing income inequality, complemented by such supplementary data as wages and household surveys. The income distribution surveys are from the modern period. The number of studies examining income inequality using income tax records has grown quite rapidly in recent decades. The results and discussions have changed our view of the past and enhanced our knowledge of how societies work. However, using such data as tax records has its caveats, too. Next, I present the main data used in the thesis and discuss just what and who are represented in the data and explain some heterogeneities in the data (see further discussions on methodology in Atkinson and Bourguignon 2015).

The main sources utilised in this PhD thesis on income inequality are national state income tax (1865, 1871, 1876 and 1880), municipal income tax (1898–99, and 1904), high income tax (1916) and state income tax (1920–2004).10 In addition, a few supplementary data sources were collected:

municipal taxation from the four largest cities (Helsinki, Turku, Tampere and Viipuri) from 1875 to 1899 (Renvall, 1900), municipal taxation in Helsinki for 1880, 1900–04 and 1906–28,11 and an income survey of six rural municipalities (Tuusula, Humppila, Vihanti, Kymi, Räisälä and Vihanti) from 1903.12 Furthermore, household budget surveys (1966–81), income distribution statistics sample data (1986–1994) and income distribution survey data (1995–2019) have been used to characterise income inequality patterns during the most recent decades, all of which are compiled by Statistics

10 Tax tabulations for the most part accounted for all taxed units, however the tabulations only consisted of a representative sample of units in 1945–1969. Fortunately, these samples included all top income earners and were relatively large (see Jäntti et al. 2010).

11 Statistical Yearbook of Helsinki, 1920–28.

12 See Gylling (1906); Gylling (1907). Central economic and demographic statistics were also collected from the following sources: wages (Heikkinen 1997, 2017b); demographic statistics (OSF 1870, 1882, 1894, 1905; Vattula 1983; Statistical Yearbook of Finland (SYF) 1879–1938, 1885, 1911, 1916, 1921, 1931, 1949, 1961–1962, 1987, 1990, 1991); GDP (Hjerppe 1989; OSF 2019).

32

Finland.13 Notably, not all statistics were compiled yearly, not even between the years 1966 and 2019.

Incomes, paid taxes and individuals are tabulated from 1865 onwards, however the microdata was gathered from the highest income earners from 1916 (approximately 0.8% of the top income earners out of the total population).14 In addition, the income inequality metrics and series have been estimated by the author for all years, despite the fact that Statistics Finland’s had already made such calculations for the years 1966–2019 based on its own microdata.

The same sources have previously been utilised in the following studies: Hjerppe and Lefgren (1974); Roine and Waldenström (2015); Jäntti (2006); Jäntti et al. (2010). Notably, tax records from the years 1898–99, 1904 and 1916 as well as data from recent years have not been previously used.

Finnish tax data is relatively uniform and of a high quality when compared with that of other countries. As highlighted in articles I and II, Finnish tax records have a few advantages in comparison with other countries (Atkinson, Piketty, and Saez 2011). First, tax statistics in Finland have been public documents. In fact, journalists even today go through these records and try to find irregularities, therefore enhancing the reliability of the statistics. Second, the number of votes available for an individual in a municipal election was until 1918 determined partly by the taxes paid.

Therefore, individuals’ incentives to hide their incomes were more limited. Third, the tax rates were extremely low, at least until the early 20th century, which diminished the benefits of underreporting incomes. Fourth, income taxation in Finland had relatively modern and efficient characteristics already in the early 1920s, and to some extent already in the 19th century. For example, the tax official boards had vast rights and the partial obligation to receive information from employers, banks, foundations and public officials (including guardianship boards) to determine taxpayers’ incomes and taxes. Since at least 1920, when tax returns were made obligatory, tax returns were compared with such documents and a person’s taxes were significantly increased if any signs of deception were detected (OSF 1869–85, 1926–2004; Statute Books of Finland (SBF) 1865–2004).15

13 Hjelt & Broms (1904, 1905); NAF (1917–18); OSF (1869–85, 1926–2004, 2021); SYF (2010).

14 Average incomes in each income bracket are available only for the years 1880, 1898–99 and 1904 and from 1920 onwards. In addition, average incomes for the tabulated income brackets are unknown for urban areas in 1898 and 1904. Therefore, we used the same average incomes in the income brackets for 1865, 1871 and 1876 as in 1880 as well as the same average incomes for the years 1898 and 1899 in urban areas. Moreover, we utilised the robust Pareto midpoint estimator method (RPME method) to control for the differences and to construct a control series. In practice, this method uses only tax bracket thresholds as well as harmonic averages and the Pareto coefficients at the top to estimate the inequality metrics. The results were quite similar when compared with our main estimations (see more details in article I, p. 8–10) (von Hippel, Scarpino, and Holas 2016).

15 Collected income taxes matched quite closely rising income levels (GDP) in 1865–1885 (see article I, Figure B1). In addition, Figure B2 in article I shows that taxation remained at a relatively similar level until at

33

The income concept refers to taxable income before 1945 and later to income subject to taxation. Moreover, due to the relatively low number and extent of transfers taxed, the income concept was relatively close to that of factor income until the early 1980s. Thereafter, the income concept changed more towards gross incomes since certain social transfers were regarded as taxable incomes (i.e. national pensions in 1983) (OSF 1869–85, 1926–2004; SBF 1865–2004). I decided to utilise incomes before deductions, when possible (from 1945 onwards), as the deduction system is rather complex and non-transparent. Most problematic issues related to agricultural and forest incomes, deductions given during the Great Depression as well as deductions for having children.16 It is practically impossible to estimate the effect of these deductions on the income inequality metrics before 1945, however we know that all social classes made use of the deductions. Furthermore, many deductions have their own benefits and purpose. For example, the deductions for children consider the differences in living expenses between large and small households, which makes comparisons more meaningful. In fact, this is more or less done in modern income data by utilising the equivalence scales (OSF 2019). Fortunately, taxable income as well as income subject to taxation measures are tightly interconnected and display similar trends after 1945 (see the series in Jäntti et al. 2010).

Therefore, most likely the complex deduction system played only a minor role in estimated income inequality metrics before 1945.

In principle, each household for the most part comprised a tax unit before 1934. However, the tax legislation allowed for some variability. A wife’s incomes could be separately taxed if she had the legislative right to control her income or wealth before 1934, even though tax legislation in general stated that the incomes of children and wives should be taxed collectively together with incomes of husbands (OSF 1920–1921, p. 4). Remarkably, the proportion of women being taxed increased significantly during the 20th century: 12.3% in 1920, 35.6% in 1952 and 41.5% in 1968. However, not until the late 1980s were both spouses included in the tax statistics as individual observations. To sum up, the tax unit was the household in 1865–1900, whereas it gradually shifted towards individual taxation until 1989, especially after 1934.17 To make the income inequality series more comparable

least the mid-1920s. See more details about international tax data and its limitations in Atkinson, Piketty, and Saez (2011).

16 See NAF (1917–18); OSF (1880, p. 2, 1937). There is evidence that in some cases tax officials made somewhat subjective tax decisions. Capturing enough information from rural areas and especially assessing agricultural and forest incomes were tricky, whereas more information existed regarding urban populations (OSF 1926–1940). See more details on the taxation system in Wikström (1985); Willgren (1910); article I;

Willgren (1932); Jäntti et al. (2010).

17Companies’ dividends were taxed as an individual tax unit in 1865, 1871 and 1876, whereas companies’

profits were taxed in 1880. Individuals who received those dividends did not pay any taxes (no double

34

and historically more realistic, I chose the following tax units: the household (1865–1900)18, married couples and single persons (over 17 years of age) (1920–1989), and individuals (1990–2003). The total number of households is derived from population statistics from the years 1880, 1890 and 1900 (OSF 1882, 1894, 1905). However, labourers were included in their master’s household in official statistics in 1865, therefore we extrapolated these numbers backwards using information about numbers of married men and widows in the years 1865–1900. Furthermore, we extended our estimations by utilising population statistics for 1930 and censuses in larger cities from 1880, 1890, 1900, 1910, 1920, 1930 and 1940 (OSF 1882, 1894, 1905, 1915, 1923, 1934, 1944). Next, the total number of tax units are estimated after 1920 as follows: individuals (over 17 years of age) – married women + taxed wives.19

Due to changes made to the lowest tax boundary as well as other changes made to the tax statistics, the share of the taxed population fluctuated from below 20% in 1865 to roughly 90% in the mid-1960s (see article II, Figure 1). Therefore, establishing proxies for the total number of tax units and the incomes of the untaxed population are crucial for estimating income inequality measures.

Prior studies have adopted both top-down and bottom-up approaches to establish a proxy for the untaxed population. In the top-down approach, the total household taxable income is set as a fixed percentage of the total private household income or GDP derived from national accounts. In contrast, the bottom-up approach utilises other sources to construct a meaningful proxy for average incomes of the non-taxed population (Bartels 2019). Since it is not possible to capture comparable total household incomes based on national accounts, I decided to utilise the bottom-up approach. In principle, I formulated five alternative assumptions for the average income of the non-taxed population: 40%, 50%, 60%, 72% and 80% of the lowest income bracket.20

taxation). It is impossible to separate companies from households between the years 1865 and 1880, however estimates from the 1920s suggest that it had only a minor effect on income inequality estimates (see article I, p. 9–10).

18 In article I, however, we utilised the household as a tax unit between the years 1865 and 1934. In practice, the differences in the estimates presented in article I and II were insignificant since it was quite improbable that wives were individually taxed. See also comparisons between these estimates in article II, Figure 3.

19 The following sources were utilised to construct the total number of tax units: Vattula (1982, p. 32);

Statistical Yearbook of Finland (SYF) (1987, p. 70, 1990, p. 68, 1991, p. 68); Statistics Finland (2018).

Between 1948 and 1968, undistributed estates are included in the tax units. The number of tax units were interpolated in some years due to a lack of data.

20 In article I, we utilised an assumption of 72% in 1865–1900 and of 50% for later years as the average income of the taxed population from the lowest tax threshold. The choice to set the average income of the non-taxed households at 72% of the lowest tax boundary was based on supplementary materials from income surveys, wages as well as tax officials’ approximations (see article I, p. 4). On the other hand, we utilised 40%, 60% and 80% as the corresponding percentages in article II. Finally, we can conclude that the trends

35 4.1.2. Methods

To estimate income inequality measures, we reconstructed distributions using the income brackets / tabulations. The methodology for reconstructing distributions based on tabulations was established already by Kuznets (1955) and utilised by, for example, Alvaredo et al. (2013). Researchers have long been using the Pareto interpolation method, which uses inverted Pareto coefficients [b(p)], where b(p) is the ratio between the average income above rank p and the p-th quantile [Q(p)]. Blanchet, Fournier, and Piketty (2017) have proposed an elaborated method that uses varying Pareto coefficients in different parts of the distribution (non-parametrical method). Equation (1) is as follows:

( ) = > ( )

( ) , ℎ 0 < < 1 (1)

There is some evidence that the generalised Pareto interpolation method is more precise than other commonly utilised methods, and it accommodates the entire distribution relatively accurately.21 The precision level of these methods can be tested by creating tabulations from microdata and examining the mean percentage gap between the estimated and observed values. Blanchet, Fournier, and Piketty (2017) highlight that the interpolation method must make use of all relevant information included in the tax tabulations. However, this is not possible in many cases. The constant Pareto coefficient method discards data on quantiles and averages at the upper end of the bracket. In addition, the log-linear interpolation method utilises only threshold information. These two methods performed the worst in the tests, and the mean relative error was 12–125 times that of the generalised Pareto interpolation method when utilising microdata from the US for the years 1962–2014. The mean-split histogram method performed better since it uses more information than just the tabulations, although it cannot be utilised beyond the last threshold. However, it failed to estimate the top of the distribution:

the mean relative error was 14 times that of the generalised Pareto interpolation method when using the US data (1962–2014). In sum, although the mean percentage gaps were not as great when using these methods with data from France (1994–2012), the generalised Pareto interpolation method clearly outperforms all other commonly utilised methods (Blanchet, Fournier, and Piketty 2017, p.

were similar in all scenarios, however top income shares proved to be more useful and precise in the long run compared with overall distribution measures, such as the Gini coefficients (see robustness tests for various percentages in article I, Figure B3 and article II, Figure 2).

21 It is possible to use the generalised Pareto interpolation method directly by going to the World Inequality Database (WID) webpage or downloading the R package from the site’s webpage (also available using gpinter command in R).

36

18–23). In fact, Blanchet, Fournier, and Piketty (2017, p. 3) argue that this method is much more precise than using large samples (100,000), even when using a small number of income brackets (four, p = 0.1, 0.5, 0.9 and 0.99).

After reconstructing the individual observations, calculating the corresponding inequality measures, such as the Gini coefficients and the top income shares (see e.g. Fellman 2018), is a quite straightforward process.

4.2. Family linkages

4.2.1. Data

The literature considering social mobility patterns in the long run is limited due to a scarcity of data (van Leeuwen and Maas 2010; Long and Ferrie 2018). Commonly utilised datasets include population censuses and other population registers, such as communion books and local population registers.

Moreover, the socio-economic statuses of individuals or families are captured by using varying types of material, including lists on educational enrolments and graduation, emigration lists, electoral registers, the notes of criminal defendants as well as wealth and income taxes (Clark et al. 2015; Clark and Cummins 2015; Song et al. 2020; Modalsli 2017; van Leeuwen and Maas 2010).

To study long-run social mobility, we have reconstructed biological and marriage-related family trees from genealogical records (the 10Gen database), which today consists of 45,026 links between parents and sons from the early 18th century to the latter part of the 20th century. The data mainly account for southern and western Finland, having largely been compiled from the digitalised church records of local parishes, but other population registers were also utilised (e.g. communion books). Fortunately, these population registers contain basic information about each individual, such as their parents, descendants, spouse, siblings, occupations, wedding date, date of divorce, and the date and place of their birth and death.22 The research procedure relies on genealogical software (Family Historian) that enabled us to reconstruct the complex family lines and relations as well as create relevant variables and output files. It should be mentioned here that the heterogamy article (III) is based on an earlier version of the 10Gen database (2016), and therefore it only consists of roughly 8,000 first marriages.

22 The main part of the data has been reconstructed utilising two main sources: first, a digitalised church records database, called the HISKI database (The Genealogical Society of Finland (GSF) 2017), and second, the online archive materials of Finnish family history associations, such as communion books (Finland’s Family History Association 2017). Moreover, various parishes have provided information on family lines and other valuable information concerning families. Lastly, the MyHeritage database has been used on several occasions to reconstruct the family lines.

37

Antti Häkkinen and his research team began reconstructing family trees over a decade ago, beginning with the families of ten poor relief recipients. These families had varying social statuses, and they moved to Helsinki from various parts of the country during the Great Depression of the 1930s. The team traced family linkages back to the beginning of the 18th century. In practice, the descendants of the first generation (807 individuals), who lived at the beginning of the 18th century, form the complete database (45,026 links between parents and sons). In addition, spouses were added to the database when married.

Remarkably, the 10Gen database enabled us to observe occupational changes during individuals’ lifetimes since occupation was captured when they left home, married, had children, remarried and migrated, and then once more when they died. The unique characteristics of the data enabled us to form more established occupational statuses for individuals, the reason our study differs from census-based studies that only capture occupation during the year in which the census was conducted. The 10Gen database has already proved its value in a study that observes the lifecycles of persons in pre-industrial society (Häkkinen 2018). In our study, we chose to use the sons’ highest occupational status during their life course, whereas parents’ occupation was captured from an event closest to their 40th birthday. Another advantage of our study is that the data match rates (completed linkages between generations compared with possible linkages) are many times higher in comparison with census-based studies from the US and the UK (for the match rates and further details, see article IV, Table 1 in appendix). One of the biggest problems in census-based studies is that individuals with the same names cannot be distinguished from each other, resulting in low match rates and possible biases from only including observations of individuals with rarer surnames or even false linkages (see further discussion about methodology in e.g. Long and Ferrie 2013b; Xie and Killewald 2013; Long and Ferrie 2013a). However, the match rates are not fully comparable between countries since the Finnish data reveal only the positive links. Fortunately, only rarely are individuals missing from the Finnish population statistics, and their movements across Finland were relatively easy to follow and include in the database.

No ideal data exist for studying social mobility in history, and the 10Gen database also has its weaknesses. We compared the 10Gen database with official statistics by examining occupational structure, average life expectancy at birth, the share of births, the share of those married and infant mortality rates (see Figure 1 and Figure 2 in article IV).23 Although the 10Gen database is rather

23 The information regarding population is derived from the following sources: OSF, 2020a, 2020b, 2020c, 2020d; Statistical Yearbook of Finland, 1987, 1990, 1991; Vattula, 1983. In addition, the social structure is constructed from the following sources: 1751 (Fougstedt 1953); 1815–1875 (Kilpi 1913, 1915); 1910–

60 (OSF 1915, 1923, 1934, 1944, 1955, 1956); 1950–2012 (OSF 1963, 2005, 2013).

38

uniquely reconstructed, it converges with the official statistics during the 18th century. After considering the conceptual differences between the 10Gen database and the official statistics, it became evident that the records in the database closely resemble the official statistics. The main conceptual differences between the 10Gen database and official statistics include the fact that our data comprehends mainly the southern and the western parts of Finland (wealthier parts) and that the occupation title assigned to an individual represents the highest occupational attainment in their lifetime, whereas the official statistics account for the whole country in a particular year. In addition, we assigned farmers’ sons to the farmer category, although they belonged to the labour classes in the official statistics until the late 19th century (1875) (see Figure 2 in article IV).24 It is notable that the 10Gen database is still under construction, and often we have an individual’s basic information but no clear occupational title in the 20th century. Lastly, it is probable that in our data some older individuals have missing death dates, although the date of death was captured quite well for

uniquely reconstructed, it converges with the official statistics during the 18th century. After considering the conceptual differences between the 10Gen database and the official statistics, it became evident that the records in the database closely resemble the official statistics. The main conceptual differences between the 10Gen database and official statistics include the fact that our data comprehends mainly the southern and the western parts of Finland (wealthier parts) and that the occupation title assigned to an individual represents the highest occupational attainment in their lifetime, whereas the official statistics account for the whole country in a particular year. In addition, we assigned farmers’ sons to the farmer category, although they belonged to the labour classes in the official statistics until the late 19th century (1875) (see Figure 2 in article IV).24 It is notable that the 10Gen database is still under construction, and often we have an individual’s basic information but no clear occupational title in the 20th century. Lastly, it is probable that in our data some older individuals have missing death dates, although the date of death was captured quite well for