Data Analysis and Cleansing - Identification of Influence Maximizers in Students’ Social Networ

There are four types of networks in the given dataset. Required data is the only source and target that will present one interaction for the social network analysis. A network’s data is simple and there is no need for cleansing except proximity data. This is due to the nature of the data. Before explaining the techniques used for data cleansing there is a need to understand what are the Bluetooth properties and proximity data. How the researchers have dealt with such data in the past.

For this purpose, section, 4.2.1 will explain the proximity sensors and proximity data along with the standards used to aggregate such data to get useful data out of the raw data. Section 4.2.2 is explaining the thresholding technique used for this study and the utility developed to clean it automatically using a python script. Similarly, section 4.2.3 explains the techniques used for aggregation and friendship criteria used to extract the relevant data. It will also have the python utility used for this task. In this way, this subchapter is summarizing the techniques used for the data cleansing before starting the analysis.

4.2.1 Bluetooth Interactions and Proximity Data

Proximity data is the location-based data to find the differences or similarities in the range of each other. It can be collected using proximity sensors, usually Bluetooth but also WIFI and QR code.

Different attributes like distance and signal strength between two connected sensors play a crucial role to analyze the context and state to make decisions for data cleansing (Finnerty, et al., 2014).

The problem here is to identify actual interactions and remove false interactions due to the auto connectivity of Bluetooth devices (Sekara, & Lehmann, 2014). It could be due to the nearby range of both devices. As they were open, they tried to connect. The review of best practices stated in the literature is the better option to decide about the choice out of available options. Previous research has stated many techniques including good and bad approaches.

Identification of friendship ties in proximity interaction data is a very challenging thing. A framework to identify the friendship ties is proposed by (Sekara, & Lehmann, 2014) using comparison with online friendship. They have used 5 minutes aggregation for the time bin and also described that it is difficult to identify that a close link in data represents an actual friendship.

They have also identified face-to-face links from Bluetooth proximity databases upon received signal strength using 5 minutes time bin and define -80 dBm as the threshold of RSSI to be considered as an actual link. But they again said it is possible to remove wrong links, but it is very hard to identify the actual link (Sekara, & Lehmann, 2014). Similarly (Finnerty, et al., 2014) have used [-60, -80] for 1-meter distance and [-80, -85] for 1 to 3 meter distance.

The identification of actual link and interaction in proximity database upon distance threshold considering RSSI is defined in the literature, (Atzmueller, & Hilgenberg, 2013) have chosen >=-65 (Do, et al., 2013) have chosen -30 to -32 within one meeting room and (Subhan, et al., 2011) has taken -30 as lower limit into consideration. There is one more threshold define by (Stopczynski., & Lehmann, 2018) and they have used a threshold of RSSI>=-75 dBm. It is a very similar work by the same author, and they have used short-range data to form a social network from proximity data. It is stated that RSSI is greater or equal to -75 dBm than the distance between two objects is less than or equal to one meter.

The identification of actual interaction is not only dependent on distance and RSSI, but the aggregation of interaction based upon a time also matters. In this context, the main range used in

the literature varies and is found to be between 5 and 15 minutes (Sekara, & Lehmann, 2014;

Stopczynski, et al., 2013).

A detailed study provides very good results for the aggregation of data in different time bins. (Do,

& Gatica-Perez, 2013) have identified that the probability of real interaction increases if we use 10 min time bin and the graphs in their research show that it remains similar or higher in the case of OR for 15 minutes aggregation. But if we increase more than that it starts missing real interaction. They have performed this study using mobile phone proximity study. (Smieszek, et al., 2016) have also shown a better probability for time-bin ranging from 5 to 15 minutes.

It is very difficult to differentiate actual real friendships out of proximity data. In any professional environment usually, there is a 15-minute break for the socialization of people that gives an idea to define the criteria of actual ties and interactions (Sarkar, et al., 2016). (Avramidis, et al., 2018) have used 15 minutes observation to evaluate the performance of friendship in a classroom environment. Similarly, in the field of medicine, a close link for the spread of corona is defined as continued interaction for 15 minutes (Qureshi, et al., 2020).

An actual interaction means a close link between two people and a close link is also defined in a recent study published on Corona. According to that

“close contacts, defined as people who had spent 15 minutes or more in face-to-face contact with the infected individual” (Qureshi, et al., 2020)

All articles referred to in the literature review are giving an overview of proximity threshold values with different ranges. Results also show the range of time bins to be used for better identification of actual interactions. Aggregation of data within those limits will provide better results. Also, friendship or identification of actual ties need to have continuous interaction for 15 minutes.

4.2.2 RSSI and Thresholding

The above methodologies and standards used for thresholding based upon the distance and signal strengths suggest that mostly used values lie in the range of -65 and -85 dBm. The most relevant value is used by (Stopczynski., & Lehmann, 2018) while doing similar work and they have

defined it as -75 dBm for getting interaction within a 1-meter distance. It is most useful and relevant out of all thresholds used in literature for different ranges of distances. For data cleansing, the threshold of -75 dBm is used for RSSI.

For data cleansing, the python utility is written that simply removes all such rows having an RSSI value less than -75 (Stopczynski., & Lehmann, 2018) and it also removes empty and non-experimental scans having use_b as -1 and -2 respectively. Empty scans also have an RSSI value of 0 as mentioned in the data description section. The output of the utility creates the required output file with a reduced number of interaction (2426279) along with the below statistics The Total Records: 5474289

Positive Records: 2418900 Negative Records: 3055389

import csv

with open('Input.csv', 'r') as fin, open('Output.csv', 'w', newline='') as fout:

# define reader and writer objects

reader = csv.reader(fin, skipinitialspace=True) writer = csv.writer(fout, delimiter=',')

# write headers

4.2.3 Time Aggregation and Friendship Criteria

Aggregation of data based upon time bins is very necessary to get only meaningful interactions.

Previous methods have used bin sizes of 5 to 15 minutes as explained in section 3.4.1 (Sekara, &

Lehmann, 2014; Avramidis, et al., 2018). For the scope of this study the best choice of 15 minutes time bin has been chosen after going through the literature (Do, & Gatica-Perez, 2013). The increase in time bin for aggregation may cause the removal of actual interactions. A python utility is developed to aggregate the data based upon the given time bin and window size. The formula for calculation of window size to be passed to utility is

Window size = (minutes to aggregate*60) - 300

The reason to subtract 300 is that the first window is represented with 0. Similarly, by using window size time aggregated in minutes can be calculated using the below formula

Minutes Aggregated = (Window size / 60) + 5

The main criteria used for friendship identification is a focus on continuous interaction between participants used by previous research (Sarkar, et al., 2016), (Avramidis, et al., 2018), and (Qureshi, et al., 2020). So, to identify the continuous interaction, the below utility counted the interactions against all similar rows in consecutive three bins as 15 minutes aggregation is used.

The count is represented by weight. Finally, from aggregated data only those interactions having weight 3 are having continuous interaction from the last 15 minutes. It is also similar to the greedy approach to choose the maximum out of all available options.

A python script similar to the thresholding script given in section 3.4.2, is used to remove aggregated interactions having a weighted count of less than 3. In simple words, it means that they are not interacting continuously in the last three-time bins of 5 minutes or the last 15 minutes.

After aggregation and removal of interaction based upon friendship criteria, the total size of interaction in the Bluetooth data is 64220 that contains good quality interaction only.

32 import pandas as PD

#############Rules##############

#5+(w_s/60)=how many minutes are aggregated

#minutes ---- (minuts * 60) - 300

#1 hour --- (60*60)- 300 should be window size for

#General --- (minutes to aggregate*60)-300

#Window to aggregate

w_s=600 #As per formula given above it will aggregate 15 minutes data bin=w_s/300

df1=data[data['timestamp'] <= (i+bin)*300]

df_split=df1[df1['timestamp'] >= i*300]

df_group=(df_split.groupby(['source','target'])['weight'].sum().to_frame(name = 'weight').reset_index())

df_group.insert(0,'timestamp',int(i*300)) df_cum=df_cum.append(df_group) #print(df_cum.to_string(index = False)) i=i+1+bin

df_cum.to_csv("Output.csv", encoding='utf-8', index=False)

In document Identification of Influence Maximizers in Students’ Social Networks (sivua 33-38)