• Ei tuloksia

2 Conceptualising the big data paradigm

2.2 Datafication and the data subject

The first parts of this chapter dealt with the evolution of the present surveillance society and connected the technological developments with a greater societal trend to track, analyse, and predict people’s behaviour. I will now turn to how datafication produces data subjects and how it relates to the big data paradigm.

The underlying logic of surveillance has not changed radically from the 90s, but the scale of data collection and the availability of data have reached a different scale, which lead to the widespread usage of the term big data since the early 00s. The definition of big data tends to vary, but the definitions tend to include the following characteristics: there are large quantities of data, it is collected in real-time and changes quickly, and the data is high in complexity owing to an extensive range of data types and sources (see e.g. Kitchin, 2014;

Laney, 2001; Franks, 2015, p. 4, 24).8 The meaning of big data has somewhat expanded from referring to what the datasets contain to including also the inferences and analysis of said data. This definition can be compared with Rule’s (1974, pp. 37-40) four factors explaining the growth of ‘surveillance capacity’: (1) the size of files, (2) the degree of centralisation, (3) the speed of information flow, and (4) the number of contacts between administrative systems and subject populations.

A key aspect of the big data paradigm is that everything is logged in the odd event that the data might at some point be useful (Schneier, 2015, p. 19). Big data thrives in an environment of data maximisation because present day data collection cannot predict future correlations. As Pasquale (2015, p. 32) explains, causal relationships do not have to be established because

‘correlation is enough to drive action’. According to Athique (2018, p. 65), big data is nothing but numerology,9 as the lack of interest in causality constitutes an explicit departure from the epistemology of science. Kitchin (2014) makes the important point that while a paradigmatic shift is underway, it is worth distinguishing between big data analysis in the form of completely inductive empiricism and data-driven science which is more firmly rooted in the scientific tradition of deductive reasoning. While the second category of science is certainly more epistemologically sound, it is at the same time influenced by the data maximisation paradigm. Although data maximisation

8 The industry usually refers to the three to five V’s of big data: volume, velocity, variety, variability and value. An oft-quoted simplified definition is ‘data too big for an excel spreadsheet’.

9 The Oxford English Dictionary defines numerology as ‘the branch of knowledge that deals with the occult significance of numbers’.

19

is unproblematic for non-personal data, it becomes controversial when applied to personal data. As Athique (2018, p. 62) points out, ‘For the purposes of the computational process alone, it is fundamentally irrelevant whether this information being unitised is about people or brightly coloured rocks’. For the people whose lives are impacted by the decisions influenced by statistical inferences, the lack of established causality is more problematic. The idea that anything can be solved with sufficient data is further facilitated by the lowered costs of retaining data. The cost of computing power and storage have decreased immensely in the previous decade and essentially removed any economic incentive to delete data that previously limited data collection and surveillant practices (Schneier, 2015, p. 24). Therefore, such incentives must be created with the help of regulation.

Corporate actors log billions of transactions worldwide to create consumer profiles for various purposes. Insurance companies create risk profiles to determine appropriate premiums (Bouk, 2018), and credit rating agencies rate individuals’ creditworthiness based on past transactions (Lauer, 2017). Online advertising networks generate detailed profiles based on online behaviour, and data brokers aggregate data from all of these sources (Schneier, 2015;

Christl, 2017). Security agencies like the U.S. NSA or the British Government Communications Headquarters (GCHQ) gather information to create security profiles (Greenwald, 2014). Although these profiles have been critiqued with terms such as dividual (Deleuze, 1992) or data double (Haggerty & Ericson, 2000, p. 606), the critique often focuses on different tools of surveillance but ignores how these profiles are constructed in practice.

What is important to note is that objective data points, such as age, sex, income, births, and deaths, have always been part of the modern bureaucratic state (see above and, for example, Giddens, 1985). The difference is that behavioural data points, such as what people like to read, what people are searching for online, whose social media profiles they look for, and other interests, are significantly easier to map than before (cf. Bolin, 2012; Bolin &

Andersson Schwarz, 2015; van Dijck, 2014, p. 201).

While these events produce objective data points, they are used to infer subjective elements such as interests, psychographics, affective states, and behaviour. More than that, the subject is ‘reproduced’ in advance (Bogard, 2012, p. 35). The consequence is that a clear distinction between objective data points and subjective elements is no longer possible.

From a profiler’s perspective, the challenge does not lie in tracking people – these technologies are already quite advanced. The challenge lies in applying the appropriate weights to a wide variety of indicators to make the right assessments of people’s traits. Some tend to disregard demographic data

20

altogether and trust the inferences instead – why keep track of a person’s sexual orientation or marital status if it can be inferred from their behaviour, networks of friends or browsing habits? In 2002, Canadian Tire executive J.P.

Martin came up with the idea to not only use past credit payment behaviour to predict whether a person was likely to pay their debt but also incorporate purchasing data in the predictive model. His analysis showed that people who bought felt pads for their furniture, carbon monoxide detectors for their home or branded motor oil were less likely to default on their loans (Duhigg, 2009).

A psychological study by Kosinski, Stillwell and Graepel (2013) demonstrated that Facebook likes could be used to accurately predict a range of personal attributes such as ‘sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender’. Sensitive data is often not needed to make sensitive inferences (for a wide range of practical examples, see O’Neil, 2016). Facebook does not, for example, offer targeting based on race, religion, disability or sexual orientation (despite asking many of these things when signing up) but does offer ‘multicultural affinity segments’ for people whose activities on Facebook ‘suggest they may be interested in content related to the African American, Asian American, or Hispanic American communities’ (Facebook, 2018a, p. 2). Another trend in profiling is the use of sentiment analysis to infer affective states. One of the more prominent examples is Spotify’s ‘mood data’ that it infers from its users’ listening behaviours and preferences. Since 2016, Spotify has shared this data with the WPP’s Data Alliance, which means that a broad range of advertisers have access to the data (WPP, 2016). Spotify users may themselves suggest moods, indicating that Spotify is partly informed by users’ own, active choices.

From a commercial perspective, the main goal is to define and find the most attractive customers, usually in the top 20% (Turow, 2006, p. 95).

Advertising networks argue that through extensive profiling, people will be shown the ads most relevant to their needs. The truth is a bit more sinister because the most valued customers are the ones that buy the most.

Discrimination is not an unwanted by-product, it is the product. Gandy (2009) has demonstrated how the uses of data are often discriminatory by nature. He underlines how geographical data can be used as proxies for racial data and thus be used to discriminate against populations without explicit collection of ethnographic data (Gandy, 2009, p. 80). By combining census data and clustering models, U.S. ZIP codes could be turned into lifestyle clusters that could then be exploited in marketing systems. These geographic information systems (GIS) can of course be used for other purposes as well, such as for demonstrating how environmental hazards tend to be concentrated in areas

21

inhabited by minority groups or differences in access to healthy foods. One of the key points Gandy (2009, p. 81) makes is that automated decisions often have a racial effect without necessarily having a racist intent. Even though automatic systems are said to eliminate human prejudice, the ways in which data are collected and interpreted may produce equally strong biases. While the provider of an advertising platform may not have a racist intent, advertisers can use these platforms for racist purposes, such as when Facebook allowed discriminatory housing ads (Facebook, 2018a, p. 3). Even though this discrimination was not made based on race but based on ‘multicultural affinity segments’, the result is the same.

Popular writing on surveillance often presupposes that the information gathered is correct; in fact, it contains a significant amount of errors owing to the statistical error rates associated with making inferences. Although the collection of data is extensive, it does not always translate into accurate predictions (Pasquale 2015: 22). The statistical origins of how profiles are constructed mean that the accuracy of a profile is subject to a margin of error.

Probabilities are never 1, and with datasets with millions of entries, even minor erroneous predictions may have devastating effects. For Haggerty and Ericson (2000: 632), profiles “transcend a purely representational idiom”, and as such the accuracy of the profiles is secondary to their pragmatic value of “allowing institutions to make discriminations among populations.” In other words, the profiles do not need to be a mirror of reality to serve their purpose as long as the error rate is within a tolerable range. In some cases, however, wrong predictions have devastating effects. Risks for false positives are also higher in some sectors than others. For example, finding evidence of money laundering and terrorism financing by looking at transactions is difficult because they differ little from legitimate transactions (Canhoto 2013: 98).

Although the data collection is extensive, it does not always translate into accurate predictions (Pasquale, 2015, p. 22). The statistical origins of how profiles are constructed mean that the accuracy of a profile is subject to a margin of error. Probabilities are never 1, and with datasets having millions of entries, even minor erroneous predictions may have devastating effects. For Haggerty and Ericson (2000, p. 632), profiles ‘transcend a purely representational idiom’ and as such, the accuracy of the profiles is secondary to their pragmatic value of ‘allowing institutions to make discriminations among populations’. In other words, the profiles do not need to be a mirror of reality to serve their purpose as long as the error rate is within a tolerable range. In some cases, however, wrong predictions have devastating effects.

Risks for false positives are also higher in some sectors than others. For example, finding evidence of money laundering and terrorism financing by

22

looking at transactions is difficult because they differ little from legitimate transactions (Canhoto, 2013, p. 98).

It is important to address one important conceptual distinction between the (post)panoptic diagram and the big data paradigm. They overlap to a high degree and it is often neither feasible nor necessary to distinguish between the two when the data concerned is personal information. However, there is one difference that I would like to underline at this point. While the origins of surveillance are also rooted in the regimes of disciplinary, bureaucratic control, the big data paradigm is less concerned with the psychological effects of being monitored. It is epistemically closer to the natural sciences than social psychology. Surveillance is naturally an important element of this type of societal optimisation, but the big data paradigm moves beyond managing populations and focuses on managing resources, human or otherwise.

Therefore, it is important to also address how personal information may be regarded as a resource and, by extension, a commodity.