• Ei tuloksia

Identification of Influence Maximizers in Students’ Social Networks

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Identification of Influence Maximizers in Students’ Social Networks"

Copied!
83
0
0

Kokoteksti

(1)

Identification of Influence Maximizers in Students’ Social Networks

Hafiz Mohsin Abdul Rashid

Master's thesis

University of Eastern Finland School of Computing

Computer Science March 2021

(2)

i

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu School of Computing

Computer Science

Hafiz Mohsin Abdul Rashid: Identification of Influence Maximizers in Students’ Social Networks

Master’s Thesis, 77 p., 2 appendices (3 p.)

Supervisors of the Master’s Thesis: Dr. Mohammed Saqr (Senior researcher) March 2021

Abstract: Social network analysis has been used to study groups, communities, and network dynamics. Previous studies have also identified leaders and active spreaders in the networks using network and node characteristics. However, these studies focused on a single type of social network instead of multiple types. Also, these studies are only limited to a specific scope, for example, descriptive analysis or study of some network dynamics using LMS data. A recently published different type of network dataset of daily communications of the same users, using different mediums, has motivated the researchers to analyze the multiple types of social networks.

To study diffusion and influence maximization, previous studies have used one type of social network instead of multiple types. For that purpose, this study identifies the influential and diffusion capability of students to highlight their importance and role in the different types of social networks. To find the influence maximizers in students' social networks of different types, the current study presents, 1-comparison of different network dynamics and metrics of the networks, 2- identification of factors behind the formation of networks using probabilistic and inferential models, and 3-identification of influential nodes using network heuristics. This study is conducted using four different types of social network data including Calls, SMS, Facebook, and Proximity networks of the same users. The results show that Proximity and Facebook are good networks having more interaction, rich clubs, and better ties in the participants. The comparison of network measures presents good results as most of the time actual measures are more than the average of randomly generated network’s statistics. The generative factors are found to be mutuality for calls and messaging networks. For Facebook and Proximity networks, shared partners are found to be the generative factor. Finally, Influential nodes are identified using important node level centrality measures. Furthermore, clusters comparison shows that most of the measures are playing a positive role in the identification of influence maximizers.

Keywords: Social Network, Social Network Analysis, Proximity Data, Exponential Random Graph Models, Influence Maximization, Diffusion

(3)

ii

Acknowledgment

I would like to express my submissive thanks to Allah Almighty, who blessed upon me with the strength to accomplish this work. I am thankful to His last and beloved Holy Prophet MUHAMMAD (PBUH), who is forever a source of knowledge and a role model for everyone. I want to express my deepest gratitude to my mother for her prayers, love, and support.

I am thankful to the University of Eastern Finland and the Department of Computing for providing me the opportunity to express my skills and capability. I want to express a deep sense of obligation to my honorable research supervisor Dr. Mohammed Saqr, whose helping attitude, research experience, intellectual ideas and suggestions, motivated me to complete this work under his supervision. I am also thankful to the authors who published the Copenhagen networks study interaction data and answered my queries during this study. I also want to extend my gratitude to Prof. Markku Hauta-Kasari and Dr. Oili Kohonen for providing an opportunity and scholarship to pursue my master’s degree from one of the finest educational institutions in Finland. Lastly, I am thankful to my family, friends, and teachers for their help and love.

Joensuu, 19 March 2021 Hafiz Mohsin Abdul Rashid

(4)

iii

List of abbreviations

SNA Social Network Analysis CNS Copenhagen Network Study SIR Susceptible Infected Recovered IC Independent Cascade

FB Facebook BT Bluetooth PR Proximity

RSSI Relative Signal Strength Indication dBm Decibel Milliwatts

API An Application Programming Interface SDH Single Discounted Heuristic

MDH Maximum Degree Heuristic DDH Degree Discounted Heuristic

UK United Kingdom

USA United States of America

ERGM Exponential Random Graph Model LPM Link Probability Model

GWESP Geometrically Weighted Edgewise Shared Partners

(5)

iv

Table of Contents

1 Introduction ... 1

1.1 Objective ... 7

1.2 Outline ... 8

2 Background Work ... 9

2.1 Social Networks Study ... 9

2.2 Diffusion and Influence Maximization ... 10

2.3 Methods for Studying Diffusion ... 11

2.3.1 The Susceptible-Infected-Recovered (SIR) Model ... 11

2.3.2 The Tipping Model ... 12

2.3.3 The Independent Cascade Model ... 12

2.3.4 The Linear Threshold Model ... 13

2.4 Heuristics... 13

2.5 Practical Implications in the Literature ... 18

3 Study Context and Dataset Description ... 22

3.1 Study Context ... 22

3.2 Dataset Description ... 23

3.2.1 Calls Network ... 23

3.2.2 SMS Network ... 23

3.2.3 Facebook Network ... 24

3.2.4 Proximity Network ... 24

4 Methodology ... 25

4.1 Types of Networks ... 25

4.1.1 Directed Networks ... 25

4.1.2 Undirected Networks ... 26

4.2 Data Analysis and Cleansing ... 27

4.2.1 Bluetooth Interactions and Proximity Data ... 28

4.2.2 RSSI and Thresholding ... 29

4.2.3 Time Aggregation and Friendship Criteria ... 31

4.3 Network Level Analysis... 32

4.3.1 Network Level Measure ... 32

4.3.2 Network Generative Factors using ERGM ... 35

4.4 Node Level Analysis ... 36

4.4.1 Node Level Measures ... 36

4.4.2 Identification of Influence Maximizers ... 38

5 Results... 40

5.1 Networks Description and Comparison ... 40

5.2 Networks Formation and Generative Factors ... 47

5.3 Influential Students and Cluster Comparison ... 51

6 Discussion and Conclusion ... 58

References ... 63

(6)

v Appendices

Appendix 1: Cluster Statistics (1 page)

Appendix 2: ERGM Detailed Statistics (2 pages)

(7)

1

1 INTRODUCTION

In the last few decades, technology has revolutionized many fields of life including medical science, social science, marketing, and education. Technology has also introduced tools and methods to analyze big data and complex networks. The United Nations Survey on big data found that the volume of data is increasing exponentially and 90 percent of the current volume of data has been produced during the last two years (United Nations, 2016). The world economic forum has published a report on statistics of data produced by different sources daily, the report states that by 2025, 463 exabytes of data will be produced each day (Desjardins, 2019).

Table 1. Data produced by social interaction and internet search daily

Source Data

Twitter 500 million tweets

Facebook 65 billion newsfeeds

WhatsApp 4 petabytes messages

Emails 294 billion emails

This massively increasing data of social sites distresses the extraction of useful information out of them. The research to extract such useful conclusions and decision-making have become easier due to the flow of data provided by social networks (Matilda, 2017). In social networks, interactions between different people are represented by the edges between nodes of the network (Mislove et al., 2007). If participants are interacting only in one network then it is a single-type network and if the same participants are interacting in different networks at the same time then they will form multiple types of social networks (Sapountzi & Psannis 2018).

Every node has some characteristics, and values representing some information and behavior of the node (Sapountzi & Psannis 2018). Using these characteristics, social networks can be represented and visualized using a graphical representation. The difficulty to understand the insights of data can be reduced by using graphical representations. It makes the problem easier to understand (Gilkey, 2019). In addition to the visualizations, a large number of mathematical measures of graphs being analyzed can provide both network-level as well as node-level information (Arif, 2015).

(8)

2

Network science and complex networks are also becoming relative in modeling complex data structures and their interactions (Csányi & Szendrői, 2004). Network science is the field of science that studies complex networks including telecommunication networks, academic networks, biological networks, semantics networks, and social networks. The elements or objects in these networks are represented by nodes and the relationship between them is represented by an edge (“Network Science”, 2020).

Network science is a mixture of many academic fields. The concepts of graph theory used in this field are extracted from mathematics. For visualizations and data mining this field takes help from computer science and for probabilistic or referential modeling it uses statistics. It also includes the knowledge of sociology to understand the social structure and interactions. In this way, multiple disciplines come under the umbrella of network science to solve the problem of the analytical study of complex networks (“Network Science”, 2020). Social network analysis is a technique used to study the social structure, interactions, and different properties of a network being analyzed. Nowadays, social network analysis is a key analysis technique for data analysis (Sapountzi & Psannis 2018).

The characteristics of nodes in social networks also propagate within the network from one node to another based upon some criteria. This process of propagation within the network is known as diffusion (Shakarian et al., 2015). Diffusion is a very important activity in network analysis showing the process of spreading. This spreading can be knowledge spreading in students and teachers for an educational social network (Saqr & Viberg, 2020), it can be disease spreading in patients in case of medical social network (Stattner & Vidot, 2011) and spread of promotion news in customers with a marketing perspective (Doyle, 2007; Tayeh & Mustafa, 2018).

There are multiple diffusion methodologies to spread the information based upon individual characteristics and collective probabilities of neighbor nodes (Shakarian et al., 2015). For a given network, such methodologies explain how the spreaders will propagate the information. The popular diffusion methodologies are the Independent Cascade Model and Linear Threshold Model originally derived from the susceptible-infected-recovered (SIR) Model and The Tipping Model respectively (Shakarian et al., 2015). There are some variations of these models available in the literature depending upon the problem and network properties.

(9)

3

SNA and diffusion can be useful in academic environments to evaluate the collaboration during an academic activity or course (Saqr et al., 2018). Diffusion is also useful to analyze the disease transformation from one patient to another. In this pandemic situation, diffusion spread and identification of influence maximizers can help to deal smartly. SNA and diffusion both can help to form such a network that can maximize the sales of the product. Previous research also provides help for the choice of the best social networking way to create a strong social network of publicity campaigns for the product (Doyle, 2007). Similarly, they can also be used for fraud detection (Omar et al., 2014), risk management (Ding et al., 2014) and, analyzing traffic layout (El-adaway, 2014).

Previous work has emphasized well on network analysis, comparison, and the generative factor behind the formation of ties (Saqr & Montero, 2020). They have also evaluated the collaboration and social behavior within the students. It is also used for performance predictions (Saqr et al., 2017). Influential nodes within a network are also identified in the previous study (Rossi et al., 2018; Saqr & Viberg, 2020).

The most influential spreaders can spread knowledge, help embrace a plan, follow a script, or foster cooperation with other students in a classroom that will ultimately improve overall learning. Network properties calculated with the help of SNA have a lot of information that can help to uncover the interesting results that are correlated with external characteristics of a learning environment (Wise & Cui, 2018). For example, the grades of students who posted more content got higher grades.

Instead of all this research, there is still a need to combine important social network concepts to highlight the overall impact and importance of studying students' social networks to improve different learning environments. By combining different social network concepts under the scope of one comprehensive study, In addition to important network statistics and visual structure, this study provides a complete guide to improve different learning activities of the students by unleashing interesting facts like how talented and active students are collaborating (Saqr et al., 2020)? How diffusion can be used to maximize the knowledge construction among the students (Saqr & Viberg, 2020)? How social interactions of students can help to predict and improve their academic performance (Saqr et al., 2017)? Which medium of communication is popular among the students (Lampinen, 2016)?

(10)

4

Social network studies have been conducted by scientists using the mobile data of users. Different social network measures have been applied by the researcher during social network analysis of mobile and telecom data. (Baruah & Angelov, 2012) have used mobile call data to perform SNA to find the key individuals, similar communication patterns, significance of participants, and evolution in the network using evolving clustering. (Al-Molhem, et al., 2019) have studied the telecom data using centrality measures to identify the influencers and they found a 30 % increase in the growth of mobile traffic when these influencers were used for the marketing campaign.

Similarly, (Zigkolis, et al., 2009; Zignani, et al., 2015) have used mobile data to analyze the interactions, relationships, and behavior of participants. All these researchers have focused on one type of network along with a focus on some centrality measures instead of studying multiple network concepts on a deeper level. Also, there is a need for visualization and classification of all the users in the mobile networks according to their influential power as done by (Saqr &

Viberg, 2020) using online discussion forum data of students.

Online social networking platforms like Facebook, Twitter, and Orkut are known by 98 % of people (Jothi et al., 2011). Scientists have evaluated the awareness and popularity ratio of different social networking sites among people and used the survey to evaluate the impact of social media on brand communication using ads (Jothi et al., 2011). But a comparison of social media networks with other networks like mobile calls and proximity networks can help to highlight the popularity of one out of multiple communication networks. (Himelboim, 2017) has explained different concepts of SNA using social media and also identified the influential nodes in social media networks like Facebook and analyzed their impact on information spread.

A study conducted in South Africa has evaluated the impact of the Twitter trend of critical personalities on public health. The results found that the influencers on Twitter especially the government personalities have a great impact on citizens in the context of the announcement of controversial health care bill (Struweg, 2020). (del Fresno Garcia et al., 2016) have identified the influential nodes on the internet and classified the user into three groups according to their influential power. For the education environments, scientists have identified the social capabilities, interactions, friendships, and social interactions of students using Facebook interaction and friendship data (Ugander et al., 2011; Catanese et al., 2012). Along with SNA, interactions, friendship ties, and the social structure of students, this study is contributing by

(11)

5

highlighting the influencers and their classification to help the students in collaboration in learning environments.

Location-based networks that are proximity networks are also helpful to understand the social interactions of the participants. Analysis of proximity data is quite challenging and complex (Finnerty et al., 2014) as separation of actual interactions from fake interaction needs multiple techniques and criteria for data cleansing. Previous researchers have provided multiple frameworks for actual friendship identification in proximity networks (Sekara, & Lehmann, 2014). SNA further helped to understand the complex structure of proximity networks formed by Bluetooth interactions (Simoski et al., 2020). The authors have also highlighted important nodes base upon degree centrality measures only (Simoski et al., 2020). This study is further adding multiple heuristics to identify the influential nodes to get more accurate results for better and collaborative groups of students to help them in the learning environment.

All of the previous studies have focused on the social network of one type or comparison of two networks e.g. Mobile data (Zigkolis, et al., 2009; Baruah & Angelov, 2012; Zignani, et al., 2015;

Al-Molhem, et al., 2019), Social media data (Jothi et al., 2011; Himelboim, 2017; Struweg, 2020), proximity data (Sekara, & Lehmann, 2014; Simoski et al., 2020). LMS data is used to form social and learning networks to evaluate academic behavior and for the social behavior of the students (Saqr & Montero, 2020). The students may hesitate to interact in these types of formal platforms for their normal day-to-day discussion (Soda & Zaheer, 2012; Dokuka et al., 2020). The social behavior of people cannot be similar in different communication networks instead they will behave differently (Karikoski & Nelimarkka, 2011; Kostakos & Venkatanathan 2010).

It is not a good idea to rely on the study results of a single dataset while understanding the social behavior of people(Karikoski & Nelimarkka, 2011). Similarly, It has already been observed that it is always good to evaluate and study multiple social networks to get a better conceptualization of interactions and social ties (Lampinen, 2016; Lim et al., 2015). Studying a single social media site or a single social network is just like a blind man explaining an elephant (Ellison, 2012). By comparison of multiple social networks, it will become easier to understand the social behavior of students in different social paradigms (Ugander et al., 2011; Lampinen, 2016).

(12)

6

Considering the need of studying multiple social networks of the same users, (Lim et al., 2015) have used multiple online social networks to analyze the user's actions through multiple social networks. They compared and contrasted the behaviors of users in various social networks to see if they are identical. The researchers wanted to know why people use different social networks if their behavior is the same across all of them. The findings showed that people behave differently in different contexts through various social networks. Similarly, (Viswam & Darsan, 2016) used different social networks to find users who were similar based on their profile information. The findings indicate that social media attributes, in addition to user profile attributes, can be used to identify the users. There is a need to recognize the similarities and differences in operation, actions, and user identity, so multiple social networks should be used (Lim et al., 2015; Viswam

& Darsan, 2016).

Researchers have used several methods to identify the positions of nodes in various social networks and contexts. The role of a node has been identified by (Huang et al., 2014) in a social network using key attributes of the node. The results show that the proposed method for deciding the role of a node performs well. The roles of students have been identified in an online discussion forum using students’ interactions with the help of a toolbox based upon social network matrices for ranking (Rabbany et al, 2014). They also claimed that the new toolbox would assist in a more equal assessment of students based on their positions in online courses. (Saqr & Viberg, 2020) have also defined the role of students in the learning environment and divided them into three categories: leaders, arbitrators, and satellites. Similarly, (Havakhor et al., 2018) discovered three forms of information spreaders across social media networks: seekers, contributors, and brokers.

However, there is still a need to examine various student social networks in the context of educational and learning environments. Identification of similarities, differences, and students' roles across multiple networks is required to understand the behavioral, social, and influential patterns of students. (Huang et al., 2014) discovered that in order to obtain better results in the sense of role recognition, it is also important to study multiple social networks to recognize and compare the students' roles across multiple social networks. Due to different topologies and network properties in different networks, roles, behavior, and social characteristics can differ. As a result, this will assist in demonstrating the various roles and social characteristics of students in various networks.

(13)

7

To fill this gap and to understand the dominant social characteristics and behavioral patterns of students in different environments, this study is using multiple types of social networks, having daily life interactions of the same students in different networks. Also, previous research is addressing multiple concepts separately but this study provides an overview of network analysis, network dynamics comparison, identification of influential nodes, visualizations, identification of rich clubs, descriptive and generative analysis all at once in one study using the dataset of multiple social networks for same participants.

1.1 Objective

This thesis aims to study how network science and influence maximization can be used to extract salient network dynamics and interpret useful information out of social networks through network analysis? What are the network factors that can impact and cause the formation of a social network? Who are the influence-maximizers in different types of social networks? For this purpose, CNS data including Proximity, Calls, Messages, and Facebook friends’ interactions have been used.

This data was collected with the help of over 700 university students involved in this research conducted at the Technical University of Denmark. The expectation behind the publication of this dataset was to encourage researchers to model the social behavior, spreading process, and measuring modes of communications using multiple types of social networks. (Sapiezynski et al., 2019)

The research questions addressed in this study are

RQ1: What are the similarities and differences in four different networks of students? How the network measures and properties differ from each other?

RQ2: What are the generative factors behind the formation of ties in these networks?

RQ3: How network-based heuristics can be used to identify influence maximizers in students' social networks?

(14)

8 1.2 Outline

In this thesis, different types of social networks of students are studied for social network analysis and to find the influence maximizers in the networks. First of all, the current study focuses on network properties, description, interpretation, and extraction of interesting information out of all networks e.g rich clubs. Secondly, it is highlighting the generative factors that are playing a major role in the formation of each network. Finally, it shows the identification of supreme and influential participants in the networks.

Chapter 2 focuses on studying diffusion, influence maximization, diffusion methods, and the work has been done in the literature using SNA, diffusion, and influence maximization. Section 2.1 is giving an overview of the study of the social network. Section 2.2 is giving an overview of diffusion, influence maximization, and their connection, section 2.3 is explaining the methods to study diffusion, section 2.4 is explaining the heuristics to highlight the influence maximizers, and 2.5 highlights the practical implications of diffusion and influence maximization.

In chapter 3, an overview of the study context where the dataset has been collected and a description of the dataset is given. The context of the dataset is described in section 3.1. The description of all datasets is explained in section 3.2 in detail. It is describing the characteristics of the data along with their use.

In chapter 4, the methodology used for analysis is highlighted. Section 4.1 presents the types of networks. Section 4.2 is explaining data cleansing techniques to get meaningful and required data for analysis. Section 4.3 states the techniques used for social network analysis and statistical models. Section 4.4 describes the techniques used for the identification of influential nodes.

Then in chapter 5, results are discussed. Network-level analysis and results are discussed in section 5.1. It also explains the robustness of the networks and provides a comparison of rich clubs in all networks. Section 5.2 is showing the generative factors. Section 5.3 is presenting the influence-based nodes using clusters and validation of clusters using ANOVA.

Finally, chapter 6 is providing a discussion showing the comparison of findings with the previous studies available in the literature and a summary of the whole work in the form of a conclusion.

(15)

9

2 BACKGROUND WORK

This chapter aims to give an overview of previous knowledge and methods used by researchers to understand the concepts that are being used or required for this study. Subchapter 2.1 is describing the basics of social networks. Subchapter 2.2 highlights the diffusion and influence maximization problem. Subchapter 2.3 explains the diffusion models used in the literature to study the influence propagation. Subchapter 2.4 explains the use of influence maximization, diffusion, and social networks in the different fields to highlight their importance and how they have been used to address real-life problems in the literature. Finally, 2.5 is explaining the centrality measures and heuristics available in the literature and reasons to choose them.

2.1 Social Networks Study

A network consists of actors (nodes) along with interactions of a specific type (e.g. friendship) that connect these nodes (Borgatti & Halgin, 2011). Social network analysis is the technique used to analyze the ties and interactions on individual node level (micro-level) also on the overall network level (macro-level) and the relationship between them (Stokman, 2001). SNA is also used to understand the community structure and group dynamics using network properties. It also highlights the social behavior within networks and content transformation within communities and participants.

Social network metrics play a major role to study the dynamics of social networks. These metrics are helpful to highlight the main aspects and characteristics of a network under study (Ghali, et al., 2012). These network metrics include Degree, Centrality measures, betweenness, density, and many other mathematical measures. They can be used to identify active actors, active knowledge spreaders, communities within the network, and collaboration between the participants.

Social networks are used to analyze the discussion forum data to understand the interactions between the actors (Rabbany et al., 2014). They are also used to measure and model safety communication for staff. (Alsamadani et al., 2013) have presented a similar idea for a small crew in the US with future research targets to evaluate multiple risks during this safety communication based upon the results gathered from SNA. A concept of efficient project management for a complex project using social networks is presented in (Lee, et al., 2018).

(16)

10 2.2 Diffusion and Influence Maximization

In social networks, characteristics and properties also propagate from one node to another and this process of propagation is called a diffusion through the social networks (Shakarian et al., 2015). This propagation and spread of property are influenced by neighbors or friends in the social network (Banerjee et al., 2013). This influential power is determined by the importance of network nodes that can be determined with the help of heuristics (Banerjee et al., 2013). This diffusion rate is helpful to spread some property to achieve a specific goal (Sandhu et al., 2016;

Son et al., 2015; Saqr & Montero, 2020; Bourne et al., 2017). Algorithms further help to optimize this diffusion process to make sure the maximum diffusion spread (Nabi et al.,2017; Rossi et al., 2018). In this context, an optimization problem called influence maximization is there to maximize the diffusion in a social network.

Influence maximization provides an efficient solution for decision-making and planning of real- life problems. The objective of influence maximization is to find a group of users in a network that can maximize the spread of influence across the network (Rossi et al., 2018). This technique helps to find out the minimum number of people in a network with maximum influential power over the whole network. Influence maximization has great importance in network study targeting to maximize the diffusion. Formally, influence maximization can also be defined as

Influence maximization is an optimization problem to find a set of nodes from a network having minimal size k, that can cover the maximum portion of the network to spread the information.

The set of such k-spreaders will have the maximum influential capability.

Input: Given a social network G(V, E)

Output: A set of nodes (seeds) K such that these nodes can propagate the property to maximum nodes in social network G.

For large-scale social networks, the identification of influential nodes is critical, time-consuming, and, challenging work. It is proved as an NP-hard problem in the literature by (Kempe et al., 2003). Researchers have been working on performance improvement of previously available techniques and also introducing new techniques to address this problem more efficiently (Chen

(17)

11

et al.,2009; Kundu et al., 2011; Rossi et al., 2018) and building methods that can be used in the influence maximization.

To study influence maximization, diffusion is an essential concept to study (Caliò & Tagarelli, 2018). Influence maximization is a problem aimed to maximize the diffusion phenomenon by selecting top influential spreaders based upon heuristics (Rossi et al., 2018). It is the reason that the influence maximization problem is covered in the scope of diffusion study. The influence maximization problem can be studied using the same methods that are used for studying in the context of the diffusion process (Shakarian et al., 2015).

2.3 Methods for Studying Diffusion

To achieve the goals of this study, an understanding of influence maximization and diffusion in social networks is necessary. The algorithms and models are available in the literature to highlight the diffusion process in the networks based upon different characteristics. The most common methods include SIR, tipping, independent cascade, and linear threshold models. SIR model is explained in section 2.3.1 and tipping, independent cascade, and linear threshold models are explained in sections 2.3.2, 2.3.3, and 2.3.4 respectively.

2.3.1 The Susceptible-Infected-Recovered (SIR) Model

The suspected-infected-recovered is a classical model introduced to study the disease spread (Yang et al., 2010). Three different states are introduced for the actors including suspected, infected, and recovered. The suspected state of the nodes shows that the nodes can be infected by other infected nodes. The infected state shows that the node can spread the disease to others.

The recovered state shows the nodes cannot be infected or infect others. The disease spread property is only owned by those nodes who got infected in the last time step (Shakarian et al., 2015). These infected nodes can only spread the disease to those neighbors who are in the susceptible state with some probability called β (Li et al., 2010). This diffusion and propagation capability can be calculated using different node level measures (Shakarian et al., 2015). The flow is shown in Fig 1.

(18)

12

Figure 1. Flow Diagram of SIR Model

The number of suspected, infected, and recovered nodes can be represented by S(t), I(t), and R(t) at some time step t. s(t), i(t), and r(t) present the fraction of suspected, infected, and recovered nodes. At any given time s(t) + i(t) + r(t)=1. (Smith & Moore, 2004)

2.3.2 The Tipping Model

The tipping model is a popular diffusion model to study the propagation of node property in social networks. It is a deterministic linear threshold modal and is also known as target set selection or seed problem (Swaminathan, 2014). Tipping is a simple term used in different contexts meaning a small variation in the system will bring a large change (Pruitt et al., 2018).

The tipping model uses the concept of activation and thresholding functions (Shakarian et al., 2015).

Activation function θ gives the activated nodes at the initialization of the diffusion process and at every timestamp t. A node can pass the information to its neighbors based upon the activation power it has. To calculate the activation power, there are different techniques used in the literature, e.g. activation power of a node can be calculated using the activation power of its neighbors and connected nodes (Zhang et al., 2016). The thresholding function simply checks that a node will be activated if some specific number of nodes in its neighborhood is active or not (Shakarian et al., 2015). If the value is more than the threshold then the current node will be activated otherwise not.

2.3.3 The Independent Cascade Model

The independent cascade model is a generalized form of the SIR model. IC model differs from the SIR model in the context of infection probability (Zang12 et al., 2014). In the SIR model, the infection probability is unique but in the IC model, every edge has its infection probability denoted by P(u,v). The notation P(u,v) shows the infection probability of an edge between u and v.

(19)

13

The infected nodes can spread the disease to those neighbors who are in the suspected state with a probability associated with the edge. The remaining states and time step concept is similar to the SIR model. So as per the definition of the independent cascade model, at each step t, the infected nodes at time step t-1 (v) will infect the inactive node (u) with a probability of P(u,v). On the other hand in the SIR model, this probability was the same. (Shakarian et al., 2015)

2.3.4 The Linear Threshold Model

The linear threshold model is a special case of the general thresholding model for studying diffusion in social networks. It is a weighted variant of the tipping model where each edge E(u,v) has a non-negative weight. It simply uses the concept of a linear threshold to activate an inactive node (Lim et al., 2015). The influence of a node is the sum of influence weights of active neighbors. For any node n in the network, the total sum of incoming weights is less than 1 (Shakarian et al., 2015).

An active node can influence an inactive node according to its weight. An inactive node is influenced by all active nodes in the neighborhood at every time step. So, according to the definition of the linear threshold model, every node v has a uniform and random threshold θ in the range of [0,1] (Talukder et al., 2019). At each time step t, each inactive node becomes active if the sum of all incoming weights of the nodes activated at time t − 1 or earlier, is greater than the threshold of node v denoted by θv. The threshold of a node is selected randomly due to a lack of information about a node (Shakarian et al., 2015).

2.4 Heuristics

Finding predominant students in a social network is not trivial but it is a tricky and extensive thing. The purpose of finding is to select the minimum number of participants that can maximize the diffusion within the network. The problem becomes complex and difficult with an increase in the total population. It is proved to be an NP-hard problem (Kempe et al., 2003). Models and algorithms available to find the seed set of influence maximizers are running in nonlinear time as there is a need to check all the combinations (Goyal et al., 2011). The time is taken by these algorithms also depends upon the available computational resources (Li et al., 2019). The use of

(20)

14

centrality measures is very common and good to find the influence maximizers. Researchers always try to improve the time by proposing some new measures or by changing something in the existing one (Alvarez-Hamelin et al., 2005; Lin et al., 2008; Hansen et al., 2011; Geum &

Kim, 2020; Kundu et al., 2011; Morone et al., 2016; Simsek et al., 2020).

Heuristics are powerful tools to study the diffusion of the influence maximizers within the network (Chen et al., 2009). They are chosen after doing a literature review of proposed measures in the previous work. After analyzing different centrality measures and heuristics approaches 11 measures are finalized based upon their importance and nature. While the selection of a measure it is considered that the measure under consideration should be different from previously selected measures.

Degree Centrality

Degree centrality measures the incoming and outgoing connections of a node. A student having relevantly more incoming and outgoing interactions increase the activity and spread the information with more probability in the network (Mochalova & Nanopoulos, 2013). It is used in the literature to study diffusion in different social networks (Ayyappan et al., 2016). It is mainly focusing on quantity instead of quality, that is why this measure may not be able to identify the actual diffusion and influence maximizers in a social network (Hansen et al., 2011).

Betweenness Centrality

Betweenness centrality measures a node’s capability to lie in-between other nodes in a network.

It is calculated by the fraction of shortest paths passing through a target node (Perez & Germon, 2016; Jia et al., 2012). A node having high betweenness has more control over the network and hence can play a primary role to speed up the diffusion process in the social network (Hansen et al., 2011). Due to its principal nature (Dey et al., 2019) has used it in finding influence maximizers in previous research.

Closeness Centrality

Closeness centrality considers the distance as a key factor to calculate the centrality score and sums the distance from a node to every other node (Jia et al., 2012). Shorter distance meaning that the closeness centrality score is higher (Saqr & Montero, 2020). It shows the ability of

(21)

15

reachability within the students that also makes the diffusion easier in the social networks (Mochalova & Nanopoulos, 2013). Researchers have also used it in finding influence maximizers in previous research and they found that the time complexity of closeness centrality is O(n2) that was better than betweenness centrality (Dey et al., 2019).

Eigen Centrality

Eigen centrality also considers the importance of connected nodes (Jia et al., 2012). Eigen centrality sums up as per the weight of neighbors’ score. More score presents the link with nodes having strong connections that increases the probability of information spread over the network (Mochalova & Nanopoulos, 2013). Eigen centrality outperforms in the context of information propagation when used to identify the seed set for influence maximizer (Dey et al., 2019). It is the reason that Eigen centrality is also included in the selection set of heuristics.

Cross Clique Centrality

Cross clique centrality counts the number of cliques to which a node belongs to. It is focusing on community or subgraph level connectivity that has importance to find the influence maximizers (Kitsak et al., 2010). It is found that if the number of cliques increases then the diffusion rate or propagation rate also increases in the social network (Faghani et al., 2013). This measure has been used in literature to study influence maximization and diffusion in different social networks (Faghani et al., 2013; Saqr & Viberg, 2020).

K-Core Decomposition

The maximal subgraph having all vertices with at least degree k is called K-Shell. The coreness of a vertex is k if it belongs to the k-core but not to the (k+1)-core (Alvarez-Hamelin et al., 2005).

K-core decomposition is considering shell-level information instead of the whole network to count a score. Based upon the K-core score, nodes belonging to the maximal K-core subgraph can be identified that can help to influence the whole network (Malliaros et al., 2016). K-core score is a more accurate spreading predictor as compared to degree and betweenness centralities (Malliaros et al., 2016). Also, researchers have used it in their proposed model to overcome the computational time due to its shell level calculation characteristic (Dey et al., 2019).

(22)

16 Gravity Centrality

Gravity centrality is using the new idea of importance concerning orbit and gravitational law (Simsek et al., 2020). Previous centrality measures consider only one aspect either neighborhood or path to calculate the influential score (Li et al., 2019). To include both aspects of neighborhood and path information, K-shell values and distance between them are used to calculate the centrality score (Li et al., 2019). As per newton’s law, it is calculated as

GC(u) =∑v∈N(u) (ks(u)ks(v))/(d(u,v))2

In the above equation, ks(u), ks(v) presents the k-shell values for u and v respectively while d(u,v) shows the distance between u and v. According to the formula, nodes with greater K-core score (presenting neighborhood information) and shorter distance (presenting path information) to other node are more influential (Li et al., 2019). It is used in the literature to study influence maximization because its performance was better than other heuristics like degree centrality, closeness centrality, and betweenness centrality (Simsek et al., 2020).

Maximum Neighborhood Component (MNC)

The maximum neighborhood component considers the number of vertices of neighbors having a maximum connected subgraph. It is using the neighborhood aspect while calculating the subgraph centrality score. Nodes connected to more neighbors have greater MNC scores and can spread the information to the large area of the network (Rossi et al., 2018). (Lin et al., 2008) have used it in their study to find the interaction hubs in networks and found that MNC performed well as a base of their proposed model.

Clustering Coefficient

The clustering coefficient (Liebig & Rao, 2014) considers the concept of transitivity and it is a good measure to identify the important nodes in the network as it highlights how well the neighbors of a node are connected. If more neighbors are connected then the probability of propagation is higher. (Liebig & Rao, 2014) have used this measure to identify the influence maximizers in bipartite networks.

(23)

17 Diffusion Degree

Diffusion Degree considers the collective diffusion instead of a score of the current node (Kundu et al., 2011; Banerjee et al., 2013). With the help of this measure, the cumulative number of interactions can be found that can help to highlight influential nodes with better probability as compared to simple degree centrality. For example, degree centrality will only give the incoming and outgoing interaction count of the current node but diffusion degree will consider the cumulative score of all connected nodes. Nodes having a higher diffusion degree will have more connections in the network considering the cumulative connectivity factor (Banerjee et al., 2013).

Ultimately, the diffusion degree will help to evaluate the influence of participants. Literature shows good performance of diffusion degree as compared to other while simple degree centralities while studying the diffusion in the context of microfinance (Banerjee et al., 2013).

(Saqr & Viberg, 2020) have also used it to examine knowledge construction in CSCL with the help of diffusion.

Collective Influence

Collective influence calculates the cumulative influence score of nodes. Nodes with more collective influence scores will highlight important leaders and influence maximizers in large- scale social networks (Morone et al., 2016). They also found that the time complexity from O(nlgn) to O(n2).

Random Heuristic

A random heuristic suggests proceeding with the selection of k nodes randomly and checks the diffusion in the network. It is not a very efficient technique due to its random nature and less performance as compared to other heuristics. (Chen et al., 2009)

Maximum Degree Heuristics (MDH)

Maximum degree heuristics suggest selecting k nodes having the highest degree (Chen et al., 2009). More connections may highlight influential nodes so nodes having a maximum degree may present influence spreaders (Mochalova & Nanopoulos, 2013). Similar to degree centrality it may not highlight influence maximizers because it is focusing on quantity instead of quality (Hansen et al., 2011).

(24)

18 Degree Discounted Heuristics (DDH)

The incoming and outgoing interactions increase the activity and diffusion in the social network (Mochalova & Nanopoulos, 2013). Degree discounted heuristic is an alternative form of Degree centrality proposed in previous work. The basic idea is to give a discount of one edge vu in degree calculation of v when v is selected as a seed node and u is the neighbor of v (Chen et al., 2009).

Degree discounted heuristic has performed much better than a classic degree and centrality measures to identify the influence maximizers (Chen et al., 2009).

Single Degree Heuristics (SDH)

Single discounted proposed in previous work where each neighbor of the newly selected node is discounted by 1. It is an alternative form of DDH and using the same idea with a discount of 1 unit in degree. (Chen et al., 2009)

2.5 Practical Implications in the Literature

Influence maximization is an optimization problem to find a group of users in a network that can maximize the spread of some property across the network (Rossi et al., 2018). This technique helps to find out the minimum number of participants in a network with maximum influential power over the whole network. Influence maximization has great importance in network study targeting to maximize the diffusion. A literature review of SNA along with diffusion, influence maximization to highlight the importance and application in different fields.

Influence, Knowledge Spread, and Educational Reforms

The classification of students based upon their capability to spread the knowledge is done by (Saqr & Viberg, 2020) in a computer-based collaborative learning environment. They have identified three groups of students, Leaders, Arbitrators, and Satellites. Network centrality measures are used to understand social structure and interaction and to find active and inactive students called leaders and facilitators respectively (Saqr & Alamro, 2019). Collaborative learning is also analyzed and monitored by (Saqr et al., 2018) using the SNA technique.

The identification of knowledge spreaders is done by (Havakhor et al., 2016). They have found three different types of knowledge spreaders including seekers, contributors, and brokers. They

(25)

19

have also found that the distribution based upon the different roles has a great impact on the diffusion of knowledge on social media networks. For example, networks with more brokers will cause more diffusion of knowledge as compared to the contributors and seeker. Different networks of the same social actors reveal interesting results when correlated with each other.

(Dokuka et al., 2020) have also found that the behavioral diffusion, for example, academic achievement spread also depends on the type of interaction and they found that friendship interactions are playing a more positive role instead of instrumental interactions. The correlation of social and learning networks of students showed that the importance of social ties in promoting learning ties (Saqr & Montero, 2020).

The peer learning concept is popular to improve the quality of education. As peer learning is demanding and popular in graduate-level students (Mustafa, 2017), active and influential students can improve the quality of education by playing an important role in the peer learning environment. But one of the challenges for teachers to create a collaborative learning environment is the identification of such active students (Le et al., 2018). Bridging nodes that are also nodes having more diffusion capabilities and take an active role and perform well to improve diffusion phenomenon among different groups (Saxena et al., 2019). They can help other week and isolated students by exchanging different ideas, expertise, and skill. Ultimately, it will help to improve the critical skills of the student in the long-life learning attitude (Mustafa, 2017).

Leading students can have an impact on peer’s learning process (Kim & Ketenci, 2019).

Previously available methods are using some moderators to improve learning in an online learning environment (Aldaihani, 2020; Hadwin et al., 2018). However, the expected functionality of these moderators is still doubtful, whether they have expectedly performed their work or not? For that purpose evaluation of learning leadership is done by (Kim et al., 2020).

They have first conceptualized the roles of learning leaders and classified them into full, transactional, and attractive facilitators. Secondly, they have proposed a model for leader identification called LIM (Leader identification model). They have found the performance of their proposed model as very good as compared to the previously available models. Similarly, previous research has also identified the ‘stars’ and ‘neglected’ students in using their social interactions (Sun et al., 2018).

(26)

20

To improve cooperative learning and teaching practices, researchers have also investigated the process of dynamic knowledge spreading and leaders' inspiration in cooperative learning environments (Li et al., 2017). They have proposed susceptible–infected–susceptible-leader (SISL) and susceptible–infected–removed-leader (SIRL) models and found that these models performed very well. They have also performed numerical analysis and simulation to identify the leader's role using dynamic transmission of knowledge (Li et al., 2017). Instead of using some models, previous studies have also used a degree centrality metric to identify the most influential students in the online learning environment using interactions of students from the discussion forum (Yudhoatmojo et al., 2017).

Influence can have positive as well as negative impacts due to friendship and peer interaction. In contrast to using diffusion and influential power of students to promote collaborative learning, motivating week students, and spread of some positive news in networks of the student, it can also be used to abandon suspicious activities like smoking, drinking, and criminal activities in students. (Saxena et al., 2019) have proposed an application in their study to study the influence phenomenon and selection of friends based upon stochastic actor base model and network science concepts. With the help of this application, the diffusion and popularity of bad habits like smoking (Lakon et al., 2017), alcohol consumption (Wang et al., 2017), and junk food consumption (Montgomery et al., 2020) can be controlled beforehand. With the help of network properties students who are smokers or alcohol-addicted can de-identified and can be isolated to stop the spread and diffusion of such behavior in the whole network.

Research Trends and Leaders

Classification and segmentation of regions based upon their research progress and leadership can give an overall picture. To improve the individual research rank, (Lin et al., 2020) have used the influence and social network techniques on the research data along with the kano model and found that the USA, UK, and Canada are at the top with the author having maximum index value from the USA. They also concluded that article type is associated with the number of cited papers.

To highlight hot topics in the field of e-Health (Son et al., 2015) has provided a visualization of ongoing research topics in the e-health field using SNA. Similarly, (e Fonseca et al., 2016) have studied network metrics of co-authorship networks in health research.

(27)

21 Disease Spread and Health Reforms

The disease and infection spread is a key challenge in the field of medical science. A lot of research is going on to suggest the methodologies to control the spread of diseases and infections in the past few years. The medical field has faced a new challenge in the form of COVID-19 in 2020, researchers are using diffusion and influence maximization to study it from multiple perspectives. (Hung et al.,2020) presented the feed behavior of people based upon their discussion about COVID-19. Similarly, diffusion is also used in different Pandemic control like influenza A H1N1 (Sandhu et al., 2016) and agent-based modeling for influenza by (Khalil et al., 2012).

Modern health facilities are making the lives of doctors and patients with the help of technology.

A similar health system is discussed by (Swan, 2009) where he has highlighted the health social networks and diffusion in the form of self-tracking, Physician Q&A, and information sharing for emotional support.

Active Farmers and Agricultural Reforms

A farmer knowledge exchange analysis was performed to highlight the farmers who grew their networks more quickly (Wood et al., 2014). They have analyzed a network of farmers and scientists and concluded that the farmers having dense connections have more diffusion power and grew up their network more quickly as compared to those having weak ties. Also, they found that diffusion of knowledge and experience sharing is based on a person instead of some role.

SNA technique is also introduced in the agricultural advisory system to provide information propagation, relationship analysis of participants, and performance of the advisory system (Bourne et al., 2017). (Ramirez, 2013) has also shown positive relationships in participation in organizations and the adoption of technology in agriculture.

(28)

22

3 STUDY CONTEXT AND DATASET DESCRIPTION

Before starting the actual analysis, it is necessary to know about the data, the context where the data is collected, and the description of all available attributes of the data in detail. To analyze human social behavior, Interaction data of the Copenhagen network study (Sapiezynski et al., 2019) was used that is available at the scientific data platform.

It is not very common that previously available data can be used directly. So, to make the available data useful in the subject study there is a need for some cleansing and transformation of currently available data. For this transformation, the deep knowledge of whole data is very necessary so that only useless data can be removed and there is a very rare chance to lose some useful and interesting information.

The below chapters will explain the context from where the data was collected with a description of the data. It also explains the data cleansing and transformation needed and the techniques used for this purpose. Subchapter 3.1 is describing the overall context, participant, environment, sources, and ethics of data publication. Subchapter 3.2 is explaining the characteristics of all attributes available in all types of interaction data.

3.1 Study Context

The researchers (Sapiezynski et al., 2019) have collected interaction data of students from different sources using their smartphones, data collection software, Bluetooth, cell phone towers, and cell phone location. This study was conducted with informed consent and approval by all participants. The data of all interactions were collected for consecutive four weeks from different types of interactions. With the help of resources available in the form of hardware devices and software, authors have collected this data for four different types of interactions. It includes Calls, SMS, Facebook friendships, and proximity interactions.

The published data has four types of interactions. The first type is based on the calls of students.

The second type of data contains SMS interactions. The third form of interaction data is collected using the connectivity of Bluetooth devices with the help of the smartphones of participants.

Finally, the list of Facebook friendships of the participants in the last type of interaction data

(29)

23

provided in this study. Gender data is also provided as a node characteristic. One major thing to note here is this data is published considering all ethical aspects and publishers have considered all the EU General Data Protection Rules (GDPR). They have published this data after anonymization and the removal of personal information of the students. (Sapiezynski et al., 2019) 3.2 Dataset Description

This subchapter aims to explain all attributes provided in the data in CNS. Section 3.2.1 is explaining about the data of calls. Section 3.2.2 is describing the data in SMS and 3.3.3 gives an overview of Facebook interactions. Finally, section 3.3.4 is describing proximity data attributes.

In this way, this chapter is summarizing the description of the attributes of all four networks.

3.2.1 Calls Network

The Calls network has a total of 3600 interactions and 4 attributes including timestamp, calling student, call recipient, and duration. This data is collected using call logs from the smartphones of the participants. The calling student represents the id of the student who is initiating the call, and the call recipient represents the id of the student who is receiving the call.

Duration is -1 in case of a missed call, but it will remain as it is in the data because it still represents an interaction between two persons who know each other that’s why the caller has the contact details of the callee. In short, calls interaction data is providing information about which student has called other students at what time and for how much duration. (Sapiezynski et al., 2019). The next section is describing the attributes available in available SMS data.

3.2.2 SMS Network

SMS network has 24333 different interactions with three different attributes that are timestamp, sender, and recipient. SMS data is collected using daily SMS logs from the smartphones of the students. The network is built by considering the sender as the source and the receiver as the target of the edge. SMS network is giving information that at some timestamp x student ‘A’ has delivered a message to student ‘B’. (Sapiezynski et al., 2019). The next section is about the data description of Facebook friendship data.

(30)

24 3.2.3 Facebook Network

Facebook friendship data for the Facebook network has 6429 unique friendship entries with two attributes defined as user_a and user_b. it is explaining that user_a is a friend of user_b and vice versa. This data was collected by using Facebook API. These two attributes are representing the ids of the participants. The necessary thing to notice here is these friendships were there before the start of the study and they did not end at the end of this study. So, this data is sharing that student ‘A’ is a friend of student ‘B’ on Facebook. (Sapiezynski et al., 2019). The next section is explaining the Bluetooth interaction data.

3.2.4 Proximity Network

Proximity interaction data is complex as compared to the other networks because of some extra attributes in it. Proximity interactions are presenting the Bluetooth connectivity between the smartphones of the students. Four different attributes including timestamp, the id of the source student, the id of the target student, and relative signal strength are captured. Due to its nature, the count of interactions is very large as compared to other networks. Some person who is not a part of the study may be captured because the Bluetooth of his device was open (Sapiezynski et al., 2019). It is an example of true negatives in the dataset. Similarly, the possible combinations to connect to open devices are huge which is also one other reason for the huge dataset.

This data was collected by using the Bluetooth device of participants. Bluetooth devices can be discovered in the range of up to 10 m. 5474289 unique entries are containing all interactions and empty scans even with low signal strength. Four different columns are timestamp, user_a, user_b, and RSSI. The data is given in the aggregated timestamps of 5 minutes called a time bin. RSSI is the received signal strength intensity showing how far the discovered device is or in simple words, how strong is the Bluetooth connection. User_a is the id of a student whose Bluetooth device is discovering and searching for other devices for pairing. user_b is the id of a student whose device is discovered by the device of user_a. Empty scans are having -1 for user_b and RSSI 0. Non- experiment scans are marked with user_b =-2. (Sapiezynski et al., 2019)

(31)

25

4 METHODOLOGY

This chapter is describing the data analysis, data cleansing methods, final measures, and tools for the scope of this study. The outline and division of this chapter are giving an overview of all methods used for this study.

Then Section 4.1 is about the identification of directed and undirected networks. Section 4.2 is giving an overview of data cleansing techniques to make the data useful for further analysis.

Section 4.3 is describing the methods and measures used for network analysis including network- level measures and generative models used to identify generative factors. Section 4.4 is explaining the node level analysis along with different heuristics and centrality measures used to identify influential students in the networks. It also includes a clustering method, and the statistical methods used for comparison of clusters to prove the accuracy of the clusters.

4.1 Types of Networks

Social networks are graphs presented with nodes and edges. Similar to graphs there are two basic types of networks. Depending upon their nature the networks can be directed or undirected.

Before starting the analysis, the network under study should be classified whether it belongs to a directed class or undirected. Section 3.3.1 explains and identifies the directed networks and 3.3.2 highlights the undirected networks with argument and reference from the literature.

4.1.1 Directed Networks

Directed networks are the networks where edges have direction from one node to another node.

In such a type of network interaction, A→B is not equal to B→A. An example of a representation of a directed network is given below in Figure 2.

(32)

26

Figure 2. Directed networks

Calls and SMS are directed networks because if the sender is sending a message or a call to the recipient it does not mean that the recipient is also doing calls / SMS to the sender. In this context direction matters to differentiate the interactions between two participants. Previous research also shows that the network formed using calls or SMS interactions is a directed network (Zignani et al., 2015).

4.1.2 Undirected Networks

Directed networks are the networks where edges have direction from one node to another node.

In such a type of network interaction, A→B is not equal to B→A. An example of a representation of a directed network is given below in Figure 3.

Figure 3. Undirected networks

About the Facebook network, it is undirected as it is representing an edge list of Facebook friendships. For example, if A is a Facebook friend of B then it means B is also a friend of A. As the data description tells that the given list is a declared list of Facebook friends, not those who sent requests to whom (Sapiezynski et al., 2019). So, it is considered an undirected network.

(33)

27

(Catanese et al., 2012) have also considered the Facebook friendship network as an undirected network.

In the case of the proximity network, it is also considered as undirected as the authors of CNS interaction data have explicitly mentioned that

“The information of directionality (whether userA discovered userB or vice versa) is discarded.” (Sapiezynski et al., 2019)

The nature of the proximity data is such that all users are discovering each other in a time bin with RSSI value. So, it cannot be supposed that the user ‘a’ is always the source (discovering) and the ‘b’ is the target (discovered). It is also another reason to consider it as an undirected network.

4.2 Data Analysis and Cleansing

There are four types of networks in the given dataset. Required data is the only source and target that will present one interaction for the social network analysis. A network’s data is simple and there is no need for cleansing except proximity data. This is due to the nature of the data. Before explaining the techniques used for data cleansing there is a need to understand what are the Bluetooth properties and proximity data. How the researchers have dealt with such data in the past.

For this purpose, section, 4.2.1 will explain the proximity sensors and proximity data along with the standards used to aggregate such data to get useful data out of the raw data. Section 4.2.2 is explaining the thresholding technique used for this study and the utility developed to clean it automatically using a python script. Similarly, section 4.2.3 explains the techniques used for aggregation and friendship criteria used to extract the relevant data. It will also have the python utility used for this task. In this way, this subchapter is summarizing the techniques used for the data cleansing before starting the analysis.

(34)

28

4.2.1 Bluetooth Interactions and Proximity Data

Proximity data is the location-based data to find the differences or similarities in the range of each other. It can be collected using proximity sensors, usually Bluetooth but also WIFI and QR code.

Different attributes like distance and signal strength between two connected sensors play a crucial role to analyze the context and state to make decisions for data cleansing (Finnerty, et al., 2014).

The problem here is to identify actual interactions and remove false interactions due to the auto connectivity of Bluetooth devices (Sekara, & Lehmann, 2014). It could be due to the nearby range of both devices. As they were open, they tried to connect. The review of best practices stated in the literature is the better option to decide about the choice out of available options. Previous research has stated many techniques including good and bad approaches.

Identification of friendship ties in proximity interaction data is a very challenging thing. A framework to identify the friendship ties is proposed by (Sekara, & Lehmann, 2014) using comparison with online friendship. They have used 5 minutes aggregation for the time bin and also described that it is difficult to identify that a close link in data represents an actual friendship.

They have also identified face-to-face links from Bluetooth proximity databases upon received signal strength using 5 minutes time bin and define -80 dBm as the threshold of RSSI to be considered as an actual link. But they again said it is possible to remove wrong links, but it is very hard to identify the actual link (Sekara, & Lehmann, 2014). Similarly (Finnerty, et al., 2014) have used [-60, -80] for 1-meter distance and [-80, -85] for 1 to 3 meter distance.

The identification of actual link and interaction in proximity database upon distance threshold considering RSSI is defined in the literature, (Atzmueller, & Hilgenberg, 2013) have chosen >=- 65 (Do, et al., 2013) have chosen -30 to -32 within one meeting room and (Subhan, et al., 2011) has taken -30 as lower limit into consideration. There is one more threshold define by (Stopczynski., & Lehmann, 2018) and they have used a threshold of RSSI>=-75 dBm. It is a very similar work by the same author, and they have used short-range data to form a social network from proximity data. It is stated that RSSI is greater or equal to -75 dBm than the distance between two objects is less than or equal to one meter.

The identification of actual interaction is not only dependent on distance and RSSI, but the aggregation of interaction based upon a time also matters. In this context, the main range used in

Viittaukset

LIITTYVÄT TIEDOSTOT

Ydinvoimateollisuudessa on aina käytetty alihankkijoita ja urakoitsijoita. Esimerkiksi laitosten rakentamisen aikana suuri osa työstä tehdään urakoitsijoiden, erityisesti

Hä- tähinaukseen kykenevien alusten ja niiden sijoituspaikkojen selvittämi- seksi tulee keskustella myös Itäme- ren ympärysvaltioiden merenkulku- viranomaisten kanssa.. ■

power plants, industrial plants, power distribution systems, distribution networks, decentralised networks, earth faults, detection, simulation, electric current, least squares

Jos valaisimet sijoitetaan hihnan yläpuolelle, ne eivät yleensä valaise kuljettimen alustaa riittävästi, jolloin esimerkiksi karisteen poisto hankaloituu.. Hihnan

Vuonna 1996 oli ONTIKAan kirjautunut Jyväskylässä sekä Jyväskylän maalaiskunnassa yhteensä 40 rakennuspaloa, joihin oli osallistunut 151 palo- ja pelastustoimen operatii-

Mansikan kauppakestävyyden parantaminen -tutkimushankkeessa kesän 1995 kokeissa erot jäähdytettyjen ja jäähdyttämättömien mansikoiden vaurioitumisessa kuljetusta

Työn merkityksellisyyden rakentamista ohjaa moraalinen kehys; se auttaa ihmistä valitsemaan asioita, joihin hän sitoutuu. Yksilön moraaliseen kehyk- seen voi kytkeytyä

Poliittinen kiinnittyminen ero- tetaan tässä tutkimuksessa kuitenkin yhteiskunnallisesta kiinnittymisestä, joka voidaan nähdä laajempana, erilaisia yhteiskunnallisen osallistumisen