• Ei tuloksia

Existing data and information quality definitions

4.4 Publication IV: An approach to improve the quality of user-generated content of

5.1.1 Existing data and information quality definitions

Within the domain of UGC, there are no proper data and information quality definitions.

Most researchers base their definitions on existing research without adequately considering the context of the applied quality characteristics (Immonen, Pääkkönen and Ovaska, 2015; Spielhofer et al., 2017; Arolfo and Vaisman, 2018; Arthur et al., 2018).

UGC is vastly different compared to traditional content and the existing definitions are not adequate. There exist over fifty quality characteristics to choose from, but most characteristics are overlapping and only a portion are chosen and used at a time. Each chosen characteristic should be modified to match the specific domain (Redman, 1996;

Wang and Strong, 1996; Batini and Scannapieco, 2006; ISO, 2008).

Table 20 presents four lists of data quality characteristics from well-known literature.

Each has been used to define data quality in their respective fields. Some characteristics repeat, and some use a different term to mean the same thing. In the leftmost column, the characteristics are described with the chosen term used in this dissertation.

Table 20. Data quality characteristics from literature

Chosen term Batini (Batini and

Accuracy Accuracy Accuracy Accuracy Accuracy

Completeness Completeness Completeness Completeness Completeness

Consistency Consistency Consistency Consistency

Credibility Trust Credibility Believability /

Reputation

Currentness Currentness /

Timeliness

Currentness Currency Timeliness

Accessibility Accessibility Accessibility Accessibility

Usability Usefulness

Relevancy Appropriateness Relevancy

Understandability Readability Understandability Interpretability Ease of understanding / Interpretability Redundancy Redundancy

Efficiency Efficiency Efficient use of

memory

Privacy Confidentiality Access security

Portability Portability Portability

Precision Precision Format precision

Compliance Compliance

Each characteristic in Table 20 requires a definition before its fitness for UGC can be evaluated. Table 21 presents existing definitions for the characteristics shown in Table 20.

Table 21. Data quality characteristic definitions

Data quality characteristic

Definition

Accuracy The closeness between data values v and v0, where v0 is the correct representation of what the data value v aims to represent. Based on syntactic and semantic accuracy (Batini and Scannapieco, 2006).

5.1 Data and information quality 75

Syntactic accuracy The closeness of words in the text to a reference vocabulary. K is the number of words, wi is a word in the text, and V is the vocabulary used in the text (Batini and Scannapieco, 2016).

𝑠𝑦𝑛𝑡𝑎𝑐𝑡𝑖𝑐 𝑎𝑐𝑐 =∑ 𝑐𝑙𝑜𝑠𝑒𝑛𝑒𝑠𝑠 (𝑤𝐾𝑖 𝑖,𝑉) 𝐾

Semantic accuracy How correctly the meaning of values represents real-world facts. An object identification problem where 𝛼 and 𝛽 are a pair of tuples to be matched, M is the set that contains a record of similar existing pair, U is the set that represents nonmatch and 𝑥 is a random vector of n number of attributes, and p() is the probability of matching (Batini and Scannapieco, 2006; Elmagarmid, Ipeirotis and Verykios, 2007).

⟨𝛼,𝛽⟩ ∈ {𝑀 𝑖𝑓 𝑝(𝑀|𝑥) ≥ 𝑝(𝑈|𝑥) 𝑈 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Completeness Completeness of a tuple with respect to the values of all its fields where Tv is the number of null values in a tuple and Nv is the total number of values in a tuple (Batini and Scannapieco, 2006; Blake and Mangiameli, 2011).

𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑛𝑒𝑠𝑠 = 1 −𝑇𝑣

𝑁𝑣

Consistency Violation of semantic rules defined over (a set of) data items, where items can be tuples of relational tables or records in a file. g is the data value, and N is the number of rules for g (Heinrich et al., 2018).

𝑟𝑛(𝑔) = {0, 𝑖𝑓 𝑔 𝑓𝑢𝑙𝑓𝑖𝑙𝑙𝑠 𝑟𝑢𝑙𝑒 𝑟𝑛

1 𝑒𝑙𝑠𝑒 ; 𝑐𝑜𝑛𝑠(𝑔) = 1 − 𝑁𝑛=1𝑁𝑟𝑛(𝑔)

Credibility How data are accepted or regarded as true, real, and credible, where dist is the distance between the sensor s and entity e, and dmax is the maximum distance acceptable (Firmani et al., 2016).

𝑐𝑟𝑒𝑑𝑖𝑏𝑖𝑙𝑖𝑡𝑦 = {1 − 𝑑𝑑𝑖𝑠𝑡

𝑚𝑎𝑥 𝑖𝑓 𝑑(𝑠, 𝑒) < 𝑑𝑚𝑎𝑥

0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Currentness Currentness concerns how promptly data are updated with respect to changes occurring in the real world (Batini and Scannapieco, 2006).

𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑛𝑒𝑠𝑠 = 𝐴𝑔𝑒 + (𝐷𝑒𝑙𝑖𝑣𝑒𝑟𝑦𝑇𝑖𝑚𝑒 − 𝐼𝑛𝑝𝑢𝑡𝑇𝑖𝑚𝑒)

Accessibility The ability of the user to access the data from his or her own culture, physical status/functions, and available technologies (Batini and Scannapieco, 2006).

Usability A collection of other characteristics characterized by usability aspects, verifiability, imperfection, and integration (Cross and Joana, 2010).

𝑢𝑠𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 𝑎𝑣𝑔(𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 + 𝑐𝑟𝑒𝑑𝑖𝑏𝑖𝑙𝑖𝑡𝑦 + 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑛𝑒𝑠𝑠 + 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑛𝑒𝑠𝑠 + 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 + 𝑔𝑟𝑎𝑛𝑢𝑙𝑎𝑟𝑖𝑡𝑦 + 𝑎𝑐𝑐𝑒𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦)

Relevancy The extent to which data are applicable and helpful for the task at hand. n is the number of words in a sentence, m is the number of characters in a word, and WordSimilarity is the similarity between two words between 0 and 1 (Yang, Feng and Fabbrizio, 2006).

𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 (𝑄, 𝑄) =1𝑛1≤𝑗≤𝑛(𝑚𝑎𝑥1≤𝑖≤𝑚𝑊𝑜𝑟𝑑𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑤𝑗, 𝑤𝑖)

Understandability The ease with which data can be comprehended without ambiguity and be used by a human information consumer (Dexun et al., 2014).

𝑢𝑛𝑑𝑒𝑟𝑠𝑡𝑎𝑛𝑑. = −0.33 ∗ 𝐴𝑏𝑠𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛 + 0.33 ∗ 𝐸𝑛𝑐𝑎𝑝𝑠𝑢𝑙𝑎𝑡𝑖𝑜𝑛 + 0.33 ∗ 𝐶𝑜𝑢𝑝𝑙𝑖𝑛𝑔 + 0.33

∗ 𝐶𝑜ℎ𝑒𝑠𝑖𝑜𝑛 − 0.33 ∗ 𝑃𝑜𝑙𝑦𝑚𝑜𝑟𝑝ℎ𝑖𝑠𝑚0.33 ∗ 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 − 0.33

∗ 𝐷𝑒𝑠𝑖𝑔𝑛𝑆𝑖𝑧𝑒

Redundancy Minimality, compactness, and conciseness refer to the capability of representing the aspects of the reality of interest with minimal use of informative resources (Batini and Scannapieco, 2016).

Efficiency The degree to which data has attributes that can be processed and which provide the expected levels of performance by using the appropriate amounts and types of resources in a specific context of use (ISO, 2008).

𝑄𝑢𝑒𝑟𝑦 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒

= 𝑙𝑜𝑔(𝑟𝑜𝑤_𝑐𝑜𝑢𝑛𝑡) / 𝑙𝑜𝑔(𝑖𝑛𝑑𝑒𝑥_𝑏𝑙𝑜𝑐𝑘_𝑙𝑒𝑛𝑔𝑡ℎ / 3

∗ 2 / (𝑖𝑛𝑑𝑒𝑥_𝑙𝑒𝑛𝑔𝑡ℎ + 𝑑𝑎𝑡𝑎_𝑝𝑜𝑖𝑛𝑡𝑒𝑟_𝑙𝑒𝑛𝑔𝑡ℎ)) + 1.

Representational consistency

Coherence of physical instances of data with their formats (Redman, 1996).

Privacy Data is hidden or concealed from others. S is the sensitivity of a data item, and V is the visibility in a given context, and R is relatedness. a, b and c are real numbers (Senarath, Grobler and Arachchilage, 2019).

𝑃𝑟𝑖𝑣𝑎𝑐𝑦𝑅𝑖𝑠𝑘(𝑖,𝑗)=𝑆𝑖

𝑎×𝑉(𝑖,𝑗)𝑏 𝑅(𝑖,𝑗)𝑐

Portability Degree of effectiveness and efficiency with which a system, product, or component can be transferred from one hardware, software, or other operational or usage environment to another (ISO, 2008).

Precision Precision refers to the amount of detail that can be discerned in space, time, or theme.

Using Levenshtein edit distance where a and b are the given values, i and j are the indexes(Levenshtein, 1965; Elmagarmid, Ipeirotis and Verykios, 2007).

𝑙𝑒𝑣𝑎,𝑏(i, j) =

Compliance Defining and evaluating the compliance between data and schemas is a measure of the relationship (similarity, relatedness, distance, etc.) between two entities. a and b are values of elements in minimum distance and 𝑎̅ and 𝑏̅ are means of all elements (Hulitt and Vaughn, 2010).

𝑐𝑜𝑚𝑝𝑙𝑖𝑎𝑛𝑐𝑒 (𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 ) = ∑(𝑎−𝑎̅)(𝑏−𝑏̅)

√∑(𝑎−𝑎̅)2∑(𝑏−𝑏̅)2

Traceability The extent to which data are well documented, verifiable, and easily attributed to a source.

R is a source, Ω is a set of R, E(Ω) is a measure of uncertainty, and λ is the number of reports (Lu et al., 2019).

𝑁𝑒𝑡𝑤𝑜𝑟𝑘 𝑡𝑟𝑎𝑐𝑒𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑁𝑇𝐸), 𝐸𝜆= ∑ 𝐸(Ω)/ (|𝑅|

𝜆)

Ω:|Ω|=𝜆

Availability Property of being accessible and usable upon demand by an authorized entity (ISO, 2008).

𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 𝑅𝑢𝑛𝑡𝑖𝑚𝑒 𝑇𝑜𝑡𝑎𝑙 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑣𝑒 𝑡𝑖𝑚𝑒

Recoverability The degree to which data has attributes that enable it to maintain and preserve a specified level of operations and quality, even in the event of failure, in a specific context of use.

(ISO, 2008).

Ability to represent null values

Ability to distinguish neatly (without ambiguities) null and default values from applicable values of the domain (Redman, 1996).

Format flexibility Changes in user needs and recording medium can be easily accommodated (Redman, 1996).

Objectivity Data is unbiased and impartial, where E is evidence, H is a hypothesis (assumed value), and p() denotes the probability (Reiss and Sprenger, 2017).

𝑤(𝐸, 𝐻, 𝐻) = 𝑙𝑜𝑔 𝑝(𝐸|𝐻𝑝(𝐸|𝐻))

Value The extent to which data are beneficial and provide advantages from their use (Wrabetz, 2017).

𝐷𝑎𝑡𝑎𝑉𝑎𝑙𝑢𝑒(𝑡) ≥ (𝐺𝑎𝑡ℎ𝑒𝑟𝐶𝑜𝑠𝑡 + 𝑀𝑎𝑖𝑛𝑡𝑎𝑖𝑛𝐶𝑜𝑠𝑡 + 𝐴𝑐𝑐𝑒𝑠𝑠𝐶𝑜𝑠𝑡)/𝐺𝐵/𝑦𝑟 ∗ 𝑅𝑒𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑃𝑒𝑟𝑖𝑜𝑑 Volume Appropriate amount of data: the extent to which the quantity or volume of available data

is appropriate. Sample size formula where z is z-score, e is the margin of error, p is standard deviation, and N is population size (Krejcie and Morgan, 1970).

𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 = 𝑁 ∗

The extent to which data are compactly represented without being overwhelming (Wang and Strong, 1996).

Data and information require different definitions for characteristics because they are two separate concepts (Davenport and Prusak, 2000; Bovee, Srivastava and Mak, 2003; Batini and Scannapieco, 2016). For example, having a list of temperatures is data, but the data becomes information when given a context, such as predicted temperatures in the upcoming week. Data quality characteristics are essential regardless of the purpose of

5.1 Data and information quality 77 data, and information quality characteristics need to consider the purpose. There can be characteristics that are important in both data and information quality. Table 22 presents the definitions for the same characteristics from Table 20 regarding information.

Table 22. Information quality characteristic definitions

Information quality characteristics

Definition

Accuracy Information is correct and free of errors (Wang and Strong, 1996).

Syntactic accuracy The closeness of words in the text to a reference vocabulary. K is the number of words, wi is a word in the text, and V is the vocabulary used in the text (Batini and Scannapieco, 2016).

Semantic accuracy How correctly the information represents the real-world facts (Batini and Scannapieco, 2016).

Completeness Information is of sufficient depth and scope for the task at hand (Wang and Strong, 1996).

Consistency Degree of similarity between perceived information (IGI Global, 2021).

Credibility Information is accepted as real and comes from a trusted source (Wang and Strong, 1996).

Currentness How promptly information is updated with respect to changes occurring in the real world (Batini and Scannapieco, 2006).

Accessibility Access to information is restricted (Wang and Strong, 1996).

Usability Based on specified usability aspects. Related to the advantage the user gains from the use of the information (Batini and Scannapieco, 2016).

Relevancy Information is applicable and useful for the task at hand (Wang and Strong, 1996).

Understandability The information is easily comprehended and without ambiguity (Wang and Strong, 1996).

Redundancy The capability to represent the aspects of the reality of interest with minimal use of informative resources (Batini and Scannapieco, 2016).

Efficiency Not applicable to information Representation

consistency

Information is identically represented.

Privacy Personal information is hidden.

Portability Not applicable to information

Precision The amount of detail in space, time, or theme within the given information (Elmagarmid, Ipeirotis and Verykios, 2007).

Compliance Information follows given rules and regulations within the context of use.

Traceability How well the information is documented and attributed to a source.

Availability Information is available or easily and quickly retrieved (ISO, 2008).

Recoverability Not applicable to information Ability to represent

null values

Not applicable to information

Format flexibility Information can be presented in a variety of formats.

Objectivity Information is unbiased and impartial (Wang and Strong, 1996).

Value The extent to which information is beneficial and provides advantages from its use (Wrabetz, 2017).

Volume The quantity of information is appropriate (Wang and Strong, 1996).

Concise representation

The extent to which information is compactly represented without being overwhelming (Wang and Strong, 1996).

5.1.2 Data and information quality characteristics for user-generated content