Fairness in Variational Autoencoders Recommenders

(1)

Fairness in Variational Autoencoders Recommenders

Faculty of Information Technology and Communication Sciences (ITC) Master’s thesis April 2019

(2)

(3)

Waleed Imtiaz: Fairness in Variational Autoencoders Recommenders Master’s thesis

Tampere University

Master’s Degree Programme in Computational BigData Analytics April 2019

Recommender systems have become increasingly relevant since big giants like Ama- zon, Netflix, and many other web services have started to enhance user’s experience based on recommendations. However, implementing state of the art system resulting in fair recommendations is the need of the hour.

Variational autoencoders (VAE) have proven to be very efficient to model user preferences in Collaborative Filtering space. Recently, the recurrent version of VAE that is based on the recurrent neural network has been proposed, known as Sequen- tial Variational Autoencoder (SVAE), which includes temporal dependencies, unlike in simple VAEs. The predicted scores for a given set of items generated by SVAE is almost the same for top-ranked items in the recommendation list, which increase unfair treatment in the long term.

This thesis presents an overview of advanced recommender systems combined with historical and the state of the art methods used to carry out recommendation tasks efficiently. Furthermore, we also described unfairness as a problem for better recommendations and discuss ways to enhance fairness by incorporating random noises during the evaluation phase of SVAE. This randomness will provide long term fairness and make the given model fairness-aware, despite little loss in ranking quality Normalized Discounted Cumulative Gain (NDCG), which is tolerable. In the end, we also compared the value of fairness with respect to the loss in NDCG when we introduce different noise distributions. It turns out that its effect on ranking quality is much less as compared to the fairness we have achieved.

Keywords: Recommender Systems,Collaborative Filtering, Fairness in ranking, Sequential Variational Autoencoders, Recurrent Networks, Sequence modeling.

(4)

1 Introduction . . . 1

2 Related Work . . . 6

2.1 Recommender Systems . . . 6

2.1.1 Collaborative filtering . . . 7

2.1.2 Content based filtering . . . 9

2.1.3 Hybrid filtering . . . 11

2.1.4 Fairness in recommender systems . . . 13

2.2 Neural Networks . . . 17

2.2.1 AutoEncoders . . . 20

3 Methods . . . 22

3.1 Variational AutoEncoders . . . 22

3.2 Sequential Model . . . 25

3.3 Amortized Fairness . . . 27

3.4 Fairness aware Sequential Variational AutoEncoder . . . 28

4 Evaluation . . . 29

4.1 Data . . . 29

4.2 Evaluation Metric . . . 32

4.3 Experiments . . . 33

4.3.1 Results . . . 34

5 Conclusions . . . 41

References . . . 46

APPENDIX A. . . 47

APPENDIX B. . . 48

(5)

1 Introduction

This chapter acts as the starting point of the thesis and sets the key concepts and vocabulary for the remaining part of the master thesis. It contains a general introduction and motivation regarding Recommender Systems followed by the key concepts, which include Collaborative Filtering, Fairness in Recommender Systems, and advanced methods like Sequential Variational Autoencoders (SVAEs). The last part of this chapter concludes with the overall description of the presented solution and the faced challenges.

Nowadays, with all the technological developments and their positive impact on society, there are some challenges as well. The advancements in communication, digital interactions, and the internet have exposed people towards large scale information and digital content. Access to digital communication channels like In- stagram, snap chat, movies, and music services, like Spotify and Netflix, are just a few examples that are introduced a few decades ago. The one common thing which ties all these services together is that they possess and share huge amounts of data.

As we know that the speed and amount of content added on the internet is huge therefore, it is a hassle for users to find the right information and content through all this data jungle. Recommender systems identify user’s needs and use that knowledge to provide users with the right content (Burke 2002). Due to the extensive growth in the domain of information technology, new systems have been developed, which can now perform advanced analysis. These systems recommend personalized content for each user. This approach is used in very diverse scenarios. The modern and advanced recommender systems are more customer-centric in a way that their analysis is examined on the user’s history. After the user’s content is processed, it is associated with other types of content that share the same features or characteristics discussed by Tkalcic et al. (2011). Users who share the same or overlapping interests are grouped together, and content that is consumed by a specific user in a group is recommended to the other user in that same group. As a result, for a given user, the recommendation is made on content used by another user with the same interests. These concepts are known as content-based and collaborative filtering methods. The renowned example of systems that use these methods is the Netflix movie recommender systems. The system displays a list of movies and TV series which users may like based on what other user and groups have watched before or have rated (Narang et al. 2010).

Collaborative filtering is one of the most used techniques in recommender systems space. It works by creating a user profile that contains ranked items given by that same user, and then ranking is compared with the wider audience. Similarities

(6)

between different users are drawn based on their historical rantings given by them.

Ratings can be defined differently for different Collaborative recommender systems.

The rating makes all the difference. A common example is the song’s recommendations by Spotify. The well-known application cases are Netflix movies, Amazon and Spotify music discussed by Hu et al. (2008).

Collaborative filtering for recommendations is a very active research area. Latent factor models (Hu et al. 2008), (Gopalan et al. 2015) are dominating the literature mainly because of their easy implementation and effective results. There are certain performance issues in terms of scalability as they are linear and restrict the capacity of the model. Historical research has shown that the addition of non-linear factor into linear models has improved the recommender system performance significantly.

Recently, collaborative filtering (CF) using neural network architectures have been introduced. In a recent survey Zhang et al. (2019) has discussed the use of autoencoders in recommender systems. Autoencoders can be used to learn hidden representation or by filling an interaction matrix in the reconstruction layer of autoencoders. CF solutions that are based on autoencoder normally take the ratings which are given by each user. The ratings are sparse. The provided input is encoded in a latent space, and unrated item’s predictions are obtained in the decoding phase where input is reconstructed. AutoRec (Sedhain et al. 2015) is an example of a very good application of collaborative filtering using autoencoders.

Variational autoencoders have been proposed as a modern and advanced method when it comes to recommendations using CF. It provides multinomial generative model which is useful to find latent-variable space which is non-linear. VAEs out- perform prior neural network based methods by providing the possibility to determine normal distribution parameters in the middle layer enhancing the rating representation of data.

When we talk about recent trends and technological improvements, we can wit- ness that current literature has been focusing on deep learning that is based on advanced neural methods. This makes sense since deep learning has a clear ad- vantage in terms of accurate recommendations over traditional approaches. A few examples which incorporate neural models, deep variational autoencoders, and its variations include Neural Collaborative Filtering (NCF) by He, Liao, H. Zhang, et al. (2017). NCF uses matrix factorization to latent models. It takes the user-item matrix and models the preferences by using a simple multilayer perceptron which utilizes latent space.

The majority of deep learning techniques used for CF are comprised of intuition based on autoencoders. AutoRec (Sedhain et al. 2015) passes user’s preference histories to an encoder. The decoder reconstructs the input, and unseen preferences can be predicted in decoder output. It shapes the data in a way that it includes

(7)

scores for every item, even if it is not rated by the user. Autoencoders also tackles cold start problem. They consider side information, which lessens the data sparsity.

Hybrid approaches that combine deep learning and latent modeling have also given attention recently. Collaborative Deep Learning (Wang et al. 2015) combines a Bayesian matrix with denoising stacked autoencoder (Vincent et al. 2010). Few examples which learn latent space of item representation are, AutoSVD++ by S.

Zhang, Yao, et al. (2017).

Since the evolution of a framework designed for variational autoencoder (Kingma and Welling 2014), (Rezende et al. 2014) is introduced in recommender systems, there has been close integration between latent models and deep learning models.

Both Collaborative Vaes (CVAE) (Li et al. 2017) and Hybrid Vaes (HVAE) (Gupta et al. 2018) takes the user information and input that information to a variational autoencoder. Vae takes that input in an encoder, and it creates a representation of items in latent space. In CVAE, the rating matrix is created based on utilizing user-item embeddings with the representation of items in latent space. On the other hand, HVAE is purely based on variational autoencoders to reconstruct the whole history of users. On the contrary neural generative model proposed by Liang et al.

(2018), models the user’s history to a latent space using multinomial likelihood. In collaborative filtering space, a lot of research has been done on modeling temporal dependencies by incorporating the user’s history with respect to time (Quadrana, Cremonesi, et al. 2018). The Factorizing Personalized Markov Chain model (FPMC) by Rendle, Freudenthaler, et al. (2010), is one example, which has been carried out by combining Markov chains and matrix factorization. FPMC is based on items having transition probabilities between them, which are being modeled by using user and item embeddings.

Fairness has appeared to be an important factor in Recommender Systems. Rec- ommendations are based on data, and data is collected from different sources. While working with data-driven systems, we must consider the bias that comes with it.

For example, while working with survey data, it may contain biased questions. The algorithm or the person collecting the data can include their own biases as well.

This can impact our results at the end. One important aspect of recommendation is fairness in rankings. Normally the items recommended are in an ordered list of the predicted scores. Users normally give more attention to the items which are in the first positions in the predicted score list. As the position decreases, the attention of the user decreases, it results in unfair treatment, especially if the distribution of estimated probabilities is quite close to each other. There should be mechanisms to show items that have high scores by arranging them in order.

While discussing advanced technology such as deep learning in the state of the art recommender systems, its superiority is clearly evident in terms of enhanced

(8)

efficiency and effectiveness when compared with traditional approaches. However, there are some downsides and challenges as well when it comes to the implementation of such methods. The biggest challenge is the complexity of the models. Due to model complexity, the process of training the model is very slow. It is very expensive to upgrade the system to train the model before it can give satisfactory results. The model training in itself is a complicated step where one needs to fine- tune the model parameters, again and again, to train the model well. In short, there are many performance issues when it comes to implementing these models for recommendations. It needs a huge amount of data to train, and with less data, one cant train the model well. We used two data sets 20 million and 100 million movie ratings. The explainability of the model is also a challenge. When we talk about fairness, there is not a lot of implementation available for fairness in recommender systems when it is based on sequential autoencoders. This is our first try to make sequential model fairness aware.

In this master thesis, we have combined deep learning and latent variable modeling to carry out the task of sequential recommendations and enhancing fairness. The sequential modeling task has been inspired by Sachdeva et al. (2019). It is assumed that the latent factor that models user preference influences the item at a given timestamp. However, user history can also impact the latent factor, and it can be modeled to achieve long term preferences. Recently many studies demonstrate the recurrent neural network capabilities by using sequential variational autoencoder to capture temporal dependencies in a collaborative filtering context (Sachdeva et al.

2019). The main architecture of the sequential autoencoder is the same as Vaes, except it caters the sequential data. We can change parameters of the inner layer with the range defined by their variance. For the same input, we can have different logical outputs. Once the sequential model is built, the next task is to enhance fairness. For this fairness approach, we took motivation from Borges and Stefanidis (2019). In the test phase of modeling, a stochastic component is added. It is observed that the added noise element impacts the output scores for the same input.

In our case, the experimental results show that the unfairness measure is reduced, but it also affected the quality of movie ratings. Although that impact is quite minimal as compared to the fairness we got. We also noticed that, the higher the value of incorporated noise, the greater it affects the predicted scores, and also the ranking order.

The contribution of this master thesis is summarized as follows.

• We extended the framework of sequential recommendation where user’s preferences exhibit temporal dependencies by adding the noise factor in the modeling phase.

• Furthermore, we discussed the adoption of the fairness problem in a collabo-

(9)

rative filtering scenario using sequential modeling.

• To the best of our knowledge, this is the first time the stochastic noise variable is incorporated in sequential modeling to enhance its fairness.

• The unfairness decreases as the variance of the normal distributed stochastic component increases. On the other hand, it impacts the ranking quality as well, but that impact is minimal.

(10)

2 Related Work

This chapter starts with the origination of the idea of the recommender systems that follow with the discussion of the different methods which are used to carry out the recommendation tasks and their drawbacks. After that, fairness as an essential aspect of recommendation is discussed and ways that can lead us to fair recommendations. In the end, we discuss in detail about modern recommendation techniques involving statistical and neural methods used for efficient recommendations.

2.1 Recommender Systems

The history of recommender systems went way back in the 90s when the first recommender system was developed by Goldberg et al. (1992). That system was built to solve the heavy load of emails present or coming to the user’s inbox. It was a filtering algorithm, where it allowed users to collaborate by registering their reaction to the documents after reading them.

In the last decade, as the world is moving towards digitization, the demand for recommender systems has been increased tremendously. In previous times people usually bought products based on the feedback given by relatives or friends from the stores, which took time and effort. However, now everything is digital, and the recommender systems provide them a selection of personalized recommendations, services, and content. This resulted in a rapid increase in smart recommender systems. Recently, many techniques have been introduced and research, both from the scientific research community and from the product companies to provide a better experience to the people, which also allows companies to increase their profit (Adomavicius and Tuzhilin 2005).

To explain the concept further, let’s take an example of a single user recommender system that has a bunch of users U and data items I, which needs to be rated. Items might be rated by the user which is represented as rating(u, i) from [0.0,5.0]; where itemi∈I and useru∈U. We assume that all ratings given by the user in the system is known as R. The users that have rated the items and items that have been rated by the user is represented asU(i)andI(u), respectively. It has been all good until here since users are rating items, and we know the preferences of specific users regarding specific items that they have rated. Here the question arises what about those items which user hasn’t rated because in real-world, it’s impossible that user can rate everything. This is where recommendations come in.

Recommender systems estimates relevance scorerelevance(u, i)for those itemsi∈I unrated by the user u∈U.

(11)

There are many approaches for building recommender systems that have been developed by researchers to find out the relevance score of each item, which has not been ranked by the user. The famous techniques which have been introduced are collaborative filtering, content-based filtering, and hybrid filtering. (Acilar and Arslan 2009), (Chen et al. 2008), (Jalali et al. 2010).

• Collaborative filtering: CF takes into account the same users and their own rantings from history. It predicts items on the basis of those rating behaviors by the user.

• Content-based filtering: The user gets the recommendations based on the similarity of their profile and preferences.

• Hybrid filtering: HF combines different collaborative filtering techniques and gives predictions.

In the next sections, we will focus more extensively about above mentioned most common techniques for the recommendation.

2.1.1 Collaborative filtering

The technique, which is the backbone of any recommender system is Collaborative filtering. For any task such as recommending music, books, and movies, the method behind that is CF. It was introduced during the Tapestry project, where they used CF for the first time in building a system (Goldberg et al. 1992). CF considers the proposition that users who have ranked specific items in the past will agree to the same evaluation in the future again. It analyses the user profile of different users, which contains their interests and preferences. Recommendations are then defined by calculating the similarity matrix between the user profiles by matching those users having the same profile.(Herlocker et al. 2004).

CF develops (user-item) matrix, which contains items ranked by all users according to their preferences. It compares user ratings of all users and predicts ratings that users have not given by comparing rankings of similar users together. The final predicted decision is made on the idea that other user’s preferences can be picked and collaborated in such a way that it provides suitable predicted ratings, which are according to the user’s preferences. The system takes feedback, which can be either explicit or implicit from the user and predicts new items by deducing similarities (Isinkaye et al. 2015). There are different approaches to define the similarity matrix and predict a rating that is not rated by the user. The commonly used CF approaches are user-based and item-based (Bobadilla et al. 2013), (Adomavicius and Tuzhilin 2005).

(12)

• user-based: Identify neighbors which refers to as most similar users and then predict an item’s rating based on what neighbors have rated.

• item-based: Identify the most similar items and treat them as neighbors, and predicted rating is based on the neighboring item rating.

The first method, user-based collaborative filtering, which is also known as k-nearest neighbor collaborative filtering, predicts ratings for user u by creating a pool of similar users, which are known as neighbors and then use neighbor ratings to the computer the final ranking (Ekstrand, J. T. Riedl, Konstan, et al. 2011). The predicted rating rt , for user usr and item i, is computed by taking a weighted average of ranking, which are given by the neighbors by keeping weights as a similarity.

rtˆ_usr,i = ¯rt_usr+

∑

v∈S^k(usr)sim(u, v) (rt_v,i −rt¯_v)

∑

v∈S^k(usr)sim(usr, v) (2.1)

The second method Item-based CF predicts rating rt by computing the similarity between the item i which is unrated by the user from the item group which has already been rated, and a group of items rated by a user usr (Celma 2010). The similarity measure, like cosine similarity, conditional probability, or Pearson corre- lation, is computed to find similarity between i and item j. (Ekstrand, J. T. Riedl, Konstan, et al. 2011).

sim(i, j) =

∑

usr∈U(rusr,i−rt¯usr) (rusr,j −rt¯usr)

√∑

usr∈U(r_usr,i−rt¯_usr)²√∑

usr∈U(rt_usr,j −rt¯_usr)²

(2.2) Collaborative filtering can be further divided into the following groups:

• Memory-based CF takes into account the whole set of user-item matrix and make the prediction on the basis of that. It creates a group of users who share the same preferences. Once the neighbors of each user are identified, the ranking of a new item for a specific user is predicted by calculating the aggregate ratings of items given by the neighbors. Both items and user-based CF belong to this group (Adomavicius and Tuzhilin 2005).

• Model-based CFuses advanced analytical methods such as machine learning and data mining. It trains the ML model on ratings already given by the users.

The model learns the rating trend and predicts the ratings of a new item based on data it is trained on (Adomavicius and Tuzhilin 2005).

As with all the techniques, collaborative filtering also comes with its advantages and drawbacks. One of the main advantages of this technique is that it is not dependent on domain knowledge for recommending something, which means it can easily

(13)

recommend items that are complicated like movies with no understanding of the movie industry or domain. On the other hand, scalability, cold start, and sparsity are some of the underlying issues of collaborative filtering algorithms. The amount of data is most important when it comes to recommendations. Hence a considerable amount of information is required by these systems to make correct recommendations that give rise to cold start problems (Lika et al. 2014). Recommendations is a computationally expensive task as it requires a massive amount of data. In the end, the most common and frequently occurred problem is that the amount of users is plentiful, but the amount of rated items is minimal. We have noticed that some times popular ranked items have fewer ratings, which makes the task even more difficult.

Matrix factorization is an extended version of a collaborative filtering algorithm utilized in recommender systems. This method works by splitting a user-item matrix into the product of two lower dimensionality rectangular matrices (Koren et al.

2009). Methods of this kind have become a success within collaborative filtering recommender systems due to their accuracy and scalability when producing recommendations. Items and users are both defined by latent factor vectors, which are deduced from the item ratings given by users. Matrix factorization consists of a training phase in which the model parameter is learned. This training is done offline. On the other hand, in CF, there is no training phase since the prediction phase takes more time.

Many extended versions of matrix factorization have been presented, and it has become the part in most recent recommender systems.

2.1.2 Content based filtering

The quality of recommendations can be improved by using the items meta-data as it gives detailed knowledge about both the user and ranking items. Due to this reason, Content-Based Filtering (CBF) was presented as a filtering method for recommendations. It works by determining items that the user has liked historically or is currently considering (Zhou et al. 2012). The items which are similar to the positively rated items are suggested to the user. It is basically a keyword specific method where keywords are used to describe the items. Thus in the movie domain, for example, keywords such as directors, type, actors, and many others can be used to find a similar relation between movies (Zhou et al. 2012). A vital trait of the content-based systems is the ability to model user’s interest by using a function that learns the distribution (Pazzani and Billsus 2007). Given a group of users and items to be ranked, the function predicts the probability of how much specific user will like an item. These predictions can be made in various forms like the probabilistic approach in which there would be probability estimation regarding the

(14)

interest of the user in an item or numeric approach in which the algorithm can compute the relevance of an item to the user. Many machine learning algorithms, such as Decision Trees, Nearest Neighbor Methods, and Naive Bayes, can be used to predict the relevance of items in content-based systems. Following steps are involved in the content-based system:

• The First step is to create a user profile u.

– In cases where there is no user profile, a user profile is generated based on their historical liked and disliked items.

• Classification algorithm is used on the data.

– The result of the classification algorithm would be the predicted score of the item i such that i /∈I(usr).

– We can also use the similarity function as an alternative.

• Top-k items are selected for recommendations.

– Items are ranked with respect to the predicted score given by the algorithm, and items with the highest value of k are chosen.

The selection of models varies with the use case if we want to predict something meaningful, for example, text recommendation, we first need to convert text into a number so that algorithm can understand. In order to find similarity between textual documents Content-based CBF uses different set of methods which include Term frequency/inverse frequency (TF/IDF), Naïve Bayes (Friedman, Geiger, and Goldszmidt 1997), Decision Trees (Duda, Hart, and Stork 2012) or Neural Networks (Bishop 2006). These techniques use statistical analysis or machine learning models to learn the representation of data and output required recommendations. The content-based methods do not depend on the profile of other users as they do not impact recommendations. This technique is more generalized as a frequent change in user profile over time doesn’t cause issues. It can adjust well with the changing user behaviors.

The content-Based technique overcomes the downsides faced with Collaborative techniques. They can suggest new items in cases where a user has not given any ratings. If there is no knowledge of user interests in the database, it still predicts the ratings. Privacy is ensured as a user don’t need to share their profile in order to receive recommendations (Lam, Frankowski, and J. Riedl 2006). The content-based method only depends on the item’s metadata. The major downside of this technique is that it needs in-depth knowledge and description of the properties of target items. This is also known as a limited content analysis. So, the efficiency of CBF

(15)

is dependent on the availability of descriptive data. Content over-specialization (T.

Zhang and Iyengar 2002) is another drawback of CBF technique. Users are limited to get only those recommendations that are similar to items already described in their profiles. For example, recommending articles based on user’s behavior of browsing news is good. However, it would be much more insightful when different services like music, movies, videos would be recommended based on behavior like news browsing. As the content-based algorithms need a lot of training data to learn the behavior of users from the user profile, they can suffer from cold start problems as well. The knowledge sources in these systems can be divided into three different groups (Wietreck 2018).

• Catalog knowledge: In-depth knowledge about the item that is being rec- ommended. This catalog is dependent on the domain, which can be quite complex. This information is in the form of features, attributes, labels, or tags (Burke 2002).

• Functional knowledge: It is all about matching the user’s needs with a specific item, which includes the type of algorithms or methodologies during the matching process (Burke 2002).

• User knowledge: To make suitable recommendations, a model based on the user is developed, which includes all the detailed data of the user, e.g., demographic or preference information (Wietreck 2018).

2.1.3 Hybrid filtering

Hybrid filtering limits the restrictions of recommender systems and the above- mentioned techniques by combining different methods together in order to perform recommendations (Adomavicius and J. Zhang 2012), (Stern, Herbrich, and Graepel 2009). The idea of combining different techniques together will result in more accurate, useful, and generalized recommendations. As we know, the shortcomings of one technique can overcome by another one (Schafer et al. 2007). The combination of methods can be accomplished in different ways such as algorithms are implemented separately, and results are combined at the end, one can also use content and collaborative method together depending on use case. Similarly, one can combine other way around as well.

Burke (2002) has defined several methods which combine different filtering techniques :

• Weighted hybrid recommender: It incorporates the results of all recom- mendation techniques to approximate predicted score for an item. The benefits

(16)

of the weighted hybrid are that during the process of recommendations, the capacity and ability of the system stay relevant.

• Switching hybrid recommender: This method pre-defines certain criteria to change between different recommendation methods. For example, a system that is developed in a combination of collaborative and content-based hybrid filtering can switch to collaborative filtering if there is a failure in content filtering or recommended results didn’t recommend according to a specific criterion. This technique needs criteria or triggers to switch between other techniques.

• Knowledge basedA knowledge-based recommender system develops explicit knowledge about the item, user preferences, and recommendation criteria by keeping in view the context. Recommendations are not based on the user’s rating history, but specific queries made by the user. Interest in knowledge-based systems has emerged recently, and it has received much attention. Recommen- dations made by these systems also takes the context of domain knowledge rather than only user and items.

• Mixed hybrid recommenderexplains recommendations from multiple techniques together. For example, in some cases, the recommendations can be mixed together in the final predicted item. One of the benefits of a mixed approach is that it minimizes the cold start problem for items to be ranked as content-based filtering can be used for items that are not being rated.

• Feature combination It is a method that merges a content-based and collaborative filtering approach. The collaborative data is used as an extra feature, and content-based techniques are used over the synthetic data set. Fea- ture combination reduces the system’s sensitivity by getting collaborative data without heavily depending on it.

• Cascade hybrid recommender: This hybrid approach first predicts the coarse rating of users, and then in the second method, it filters the recommendation predicted for a user. This method allows the system to work with already filtered data, which makes the task efficient, effective, and fast.

• Feature augmentation : This method works in a sequential way, it starts with analyzing the first technique to produce a rating or classification for an item i, it then passes the result to the second method. It uses the output of the first technique as input features used by the second technique.

Several studies by Burke (2017) and Burke (2002) have shown that the effectiveness of the hybrid recommender is greater than other pure techniques like collaborative

(17)

and content-based. These methods are used to control some of the common issues in recommender systems, such as cold start. Netflix is a very good example that has Incorporated hybrid techniques for recommendations. It collaborates collaborative filtering by examining the search and viewing habits of the same users and content- based filtering by suggesting movies that have certain features that have been rated high by users.

2.1.4 Fairness in recommender systems

In recommender systems, we must identify the critical role of personalization. The core of recommendations is that the best-suited items for an individual user may not be the best for other users. It is also essential to note that recommender systems exist to help transactions. Thus, many recommendation applications include multiple stakeholders and, therefore, may give surge to fairness issues for more than one group of participants (Abdollahpouri, Burke, and Mobasher 2017). Recent research in recommender systems has identified diversity as an essential part of the quality recommendation. The algorithms which are trained at increasing diversity among recommended items result in bias treatment between the users. The concept user fairness has been ignored so far, which is satisfying user by recommending items according to their interests or in a group of users every recommended package must have something for every user in the group.

In order to suggest items to the user, recommender systems learn the prior user interactions and the choices they have previously made, like giving ratings (for movies, products, etc.). The effectiveness of such an algorithm usually depends on the accuracy of predicted items by the system to the user, which can be calculated by keeping in view the interests of the user regarding the predicted item (Leonhardt et al. 2018). The criticality of user fairness comes into existence when we find the effect of recommendations on users. On the contrary, it can be unfair to disregard the preference of a specific set of users while trying to improve diversity in recommendations. In contrast, it is equally not good if there is no element of novelty in the items recommended to the user (Leonhardt et al. 2018). A more desperate state can be observed in a job site where user behaviors can change due to additional features, which is a user who is less confident will click lower salary jobs, and he will be presented with only lower salary jobs even he is qualified for high salary job.

We considered two major sources of unfair distributions in recommender systems.

The first, and more obvious, is the constantly changing distribution caused by the recommendations of items to users. This results in unfairness where some items are recommended rarely. The second is caused by algorithms that address unfairness in the marketplace. Much of the prior research is related to enhancing personal recommendations by improving diversity, in which the goal is to give distinguishable

(18)

recommendations to end-users and combine diversity, which focuses on enhancing item diversity among all users. Though individual and combined diversity can be used for enhancing the fairness factor for users and items, respectively. Individual diversity focuses on providing users with a diverse set of recommendations, while combined or aggregated diversity focuses on enhancing item diversity. In creating fairness, one needs to examine fairness factors that should not unfairly distinguish against a specific group of users (Leonhardt et al. 2018). There is sufficient literature in Machine learning regarding fairness and discrimination in Data Mining, and Information retrieval systems. Fairness is implemented in different forms that some- times relate to different discussions such as securing fair results for diverse groups, or equal accuracy, or equal false positives/negative rates.

• Basic Fairness Definitions

There are different definitions of fairness, depending on our interpretation for the satisfaction of user i with the recommended item i (Sacharidis 2019). In fairness proportion, one can make the assumption that user u is happy by recommended item i, if user u can see item i in top-ranked suggested items.

According to the definition, it means that the user u finds the package that is recommended as fair if there are at least some items in the package that the user likes. On the contrary, when the ranked item i was given by user u, is in top ratings, the user is satisfied by the item. This is known as envy- freeness in the group. In short, the user is satisfied with the recommended package if there are m items that the user likes in the package. The above mentioned definitions are given by Herreiner and Puppe (2009), Serbos et al.

(2017). In advanced statistical and modeling algorithms, the quality of fair results is normally attached to minimizing bias against a set of single or user groups. Fairness through awareness (Dwork et al. 2012) assures indifference to delicate characteristics (e.g. age, race, gender) as a plan for balancing results.

Hardt et al. (2016) recommends in the study that total true positives must be equal in all groups, which results in giving similar opportunities. When items are rated in a particular order, the percentage of individuals that appear in a prefix of the ranking must be more than a given percentage to assure statistical tests of representatives as described by Zehlike et al. (2017). The items which are in different positions according to their ranking receive a different level of attention: Lower ranked items receive a lot less attention than higher-ranked items. One solution to balance the opportunity for items to get attention is to re-rank them after the scores are computed (Singh and Joachims 2018).

This minimizes unfairness measures. The issue we handle here is that in the recommended list of items, there is almost the same relevance score between items. In these situations, an important decision has to be made regarding

(19)

the top-ranked items. One potential solution for this situation is Amortized Fairness, which is proposed by Biega, Gummadi, et al. (2018). In Amortized Fairness, the output result of the prediction related to the relevance, and accumulated relevance should be proportional to accumulated attention across a series of rankings.

• Fairness in Collaborative filtering

Fairness in computing has been discussed in regards to equality and justice for ages. In machine learning, fairness is discussed as a lack of judgment. Almost every recommender system is dependent on data, and it has been observed that the data-driven system can cause statistical unfairness or bias. This can be due to different sets of reasons which mainly include training data bias, user’s reaction to a certain group of items which gives birth to user bias in behavior, and there is bias in data already (Altenburger et al. 2017).

The most dominant recommendation technique, namely collaborative filtering (Wang et al. 2015), takes an input of user behavior, it ignores user demo- graphics and item attributes. However, this does not mean that fairness can be ignored; for example, a recommender system suggesting job opportunities to job seekers. The ultimate goal of developing such a system should be to ensure that male and female users with similar qualifications get recommendations of jobs with same rank and salary. The system would need to guard against biases in recommendation output, even if the biases arise entirely due to behavioral diversity: for example, male users might be more inclined to click on high paying jobs. Tackling such biases is difficult if we cannot establish a shared global decision ranking over items.

• Fair Ranking

A fair ranking has a certain set of characteristics that includes individual fairness, which means similar items are treated consistently. There should be an adequate appearance of items that belong to different groups such as disad- vantaged/preserved groups that limit harm to the other group members of this group. In fair rankings, two main lines of works have been proposed and discussed widely, which includes the attention-based and probability-based measures (Castillo 2019). The attention-based measure is a measure of the attention that different items receive through different sources which include click-through rates, or the possible consideration they might get, through dele- gates such as the possibility that items will be considered related (and probably clicked). Second group, probability-based measures consider that there is a random process that generates ranking one such process is proposed by Yang

(20)

and Stoyanovich (2017). This statistical process involves comparing distributions of a wide set of groups (KL-divergence) and then taking the average of their difference. The difference which is taken is calculated in the discounted way, which is similar to Normalized Discounted Cumulative Gain (NDCG, it is widely used in extracting information (Järvelin and Kekäläinen 2002).

• Multi-sided Fairness

Fairness can have a multi-sided purpose in different recommendation settings, especially in cases where we want to have fair results when multiple individuals are involved (Burke 2017). These multiple stakeholders are divided into different categories: consumers C, providers P, and platform or system S.

Recommendations are delivered to consumers; they are the main end-users.

The provider’s job is to stay behind and anticipate consumer demands and provide what is needed in case of recommendations; the provider offers the recommended items that are needed. In the last category, the platform which has built the recommender system is bridging the gap between consumers and providers and, as a result, gets the financial benefit. Recommendation pro- cesses within multi-sided platforms can give rise to questions of multi-sided fairness. There are three classes of systems, characterized by the fairness issues that arise relative to these groups: consumers (C-fairness), providers (P-fairness), and both (CP-fairness).

• Fairness in Groups

Group recommendation has caused motivation in important research efforts for its influence in benefiting a group of users. The idea is to try to increase the user satisfaction of every member of the group while minimizing the unfairness measure between a member of the group. For example, in a given group Group of users, we like to recommend certain package P to group members u. We examine two distinct perspectives of fairness. One aspect is that each user gets what it is looking for in the package, which means there are sufficient items in the package which different users like as compared to the items which they don’t like. This is also called fairness proportionality. The second perspective is envy-freeness, which means for every member in a group, there are enough items in the package that one likes more than others do (Serbos et al. 2017).

We have seen the rise of multiple approaches concerning the adaptation of fairness in recommender systems. The fair ranking is predicted rating received by an item based on user behaviors. Diversity focuses on the variety, both from user and item point of view. On the other hand, attention-based measures calculate the attention

(21)

given by a user to the item, which tells the user’s interest. While evaluating different approaches, we must have a good understanding of the problem. All these approaches help us understand not only the cause of unfairness but also make the path to unbiased recommendations.

2.2 Neural Networks

The idea of a Neural Network was presented by N. Rashevsky in 1938 (Rashevsky et al. 1938). Neural networks are formally described as a representation of our brain, or we can say neurons that constitute the brain. The design was further developed by McCulloch and Pitts in 1943 (McCulloch and Pitts 1943). The brain is made of millions of neurons, which can get triggered in different situations depending on the received stimulation. Neurons make a connected graph in a way they are joined together, so when action is triggered by one neuron, it is basically triggering the stimulation of other neurons, which results in, for example, walking.

The major part of the design of the neural network includes perceptron, which behaves similarly to neurons in the brain. The major role of the perceptron is to accept input through connections (like synapses in the brain) and do action and produce relevant output like in figure 2.1. Each input connection carries a weight which is denoted as(W_i,j). Each input x is then multiplied with its respective weight to make the input to the perceptron complete. The cumulative input given to a perceptron is the sum of all input values and their weights attached to the connections, as in equation 2.3. We can measure the value of perceptron by implementing some activation function on the cumulative sum of the product of input and weights, as in equation 2.4. We can use different activation functions depending on our needs, for example, some of the commonly used functions such as sigmoid, hyperbolic, and relu. The difference in these functions lies in their ranges, such as the sigmoid range is between 0 and 1. On the other hand, hyperbolic ranges from -1 to 1. Sigmoid is mostly preferred and commonly used as it is nonlinear. The perceptron gives the output as in equation 2.4, which is then transferred through the connections in the network.

x_j = ∑

j∈i in connections

W_i,jx_i (2.3)

aj =σ(xj) = 1

1 +e⁻^x^j (2.4)

There are three layers in the neural network when it is fully connected. The first layers manage every single input through nodes. The second layer is called the middle layer, which is also called the hidden layer; it also contains some nodes.

Finally, the last layer is the output layer, which gives out of neural network. In the

(22)

Figure 2.1 ANN

generalized form of a fully connected feed-forward artificial neural network, every neuron from layer l is connected with the output of all the neurons from the last layer, which carries its own inputs, weights, and biases. This whole architecture is also known as Multilayer Perceptron (MLP) Stegemann and Buenfeld (1999).

Figure 2.2 MultiLayer perceptron

• Feed Forward NetworkThe first neural network developed was feed-forward.

The term feed-forward gives the idea that information flow will be in a forward direction. In terms of a neural network, this concept of feed-forward means

(23)

that once we input data through input node, it will move forward to a hidden layer where there will be computations using input an activation function, and then the output would be presented in the last layer which is the output layer.

The data flow is only in one direction that is forward. Even if we include multiple hidden layers, the flow is strictly feed-forward. This means that there is no feedback loop in the network, so it is difficult for a model to learn the distribution if there is no way for feedback, as shown in figure 2.2.

• Back Propagation Learning The neural network is basically a machine learning technique that learns the pattern from the data which we give to it while training the network. The procedure which is commonly followed to train ANN is by giving feedback after every prediction so that model can learn the mistakes and correct them. This core process of training a neural network is termed as Back-propagation. In back-propagation, the input weights are updated after every loop of iteration, which is dependent on the error rate received during the previous iteration. This weight tuning helps the NN algorithm learn better and more generalized by limiting error rates, which results in a reliable, trained model. In the training phase, the training data i.e. (input), is passed through nodes in the network, as described in the above section. For a given input data, the specific output values are obtained. The resulted output is then compared with the required original output in a data set, and an error is computed as in equation 2.5, which tells how far your output value is from the original value. This error is then back-propagated to the network, which updates the weights and repeats the process again until the error is minimized as in equation 2.6. The main knowledge carriers in ANN are the weights. The whole idea behind ANN is to find the right weights so the model can learn the given distribution of data and give generalized output.

The back-propagation uses a gradient descent method to find the right weights.

It tries to find the minimum of the mean squared error metric Mitchell (1997).

δ_e=o_e(1−o_e) (t_e−o_e) (2.5)

δ_i =o_i(1−o_i) ∑

e∈outputs

W_ei·δ_e (2.6)

In above equation 2.5, and equation 2.6, e^th output node’s error is δ_e. It is calculated from the network’s predicted output value for that node which iso_e and the desired output value which ist_e . Hidden layer’s errors are calculated by taking the sum of all the weights W_ei which is multiplied by the errors from

(24)

the connecting output δ_e nodes, and then multiplying the sum by the output of the i^th hidden node o_i.

2.2.1 AutoEncoders

An autoencoder (AE) (Rumelhart et al. 1985) is a type of feed-forward ANN that aims to learn a representation of input data and reconstructs their respective input later in the decoding phase. Typically, the goal is to determine important structures of the data.

The AE is divided into two essential components, which are encoder and decoder.

The task of the encoder is to take data as an input and map that data in latent space.

z=f(x=s_f (Wx+b_z) (2.7)

where s_f acts activation function, and weight is W∈R^d^z^×^d^x. b_z ∈R^d^z is bias.

After the encoding phase, the AE decodes the data by reconstructing the original representation of input with minimal loss Tan et al. (2015).

ˆ

x=g(z) = g(f(x)) =s_g(W^′z+b_x) (2.8) Where g_f is an activation function, and others are the same as explained above weighted matrix and bias.

The autoencoders training procedure finds the set of parametersθ = (W,b_z,b_x) due to which loss function is minimized, L(i, g(f(i))).This measures the quality of the reconstructions by recreating output almost identical to input. Mean square error(MSE) is used to compute loss during reconstruction of input.

Loss=∥i−ˆi∥²2 (2.9)

The architecture of AE is the same as the Neural network, which has all input middle output and layers connected as in figure 2.3. As the model intends to reconstruct the input signal, the output layer is of equal size as the input layer.

Autoencoders can be forced to learn compressed representation if their latent code z has a lower dimensionality than input space x. In that structure, an autoencoder can be used for dimensionality reduction tasks. Autoencoder having one hidden layer, linear activation functions, and MSE as the loss, it is probably equivalent to Principal Component Analysis PCA Murphy (2012).

• Denoising AutoEncoder Denoising AE (DAE) operates as a standard AE with one exception that the training data that we are feeding in the encoding

(25)

Figure 2.3 AutoEncoder

phase is corrupted. Nevertheless, the purpose remains the same that we still need to reconstruct the initial non-corrupted training data in the decoding step of the algorithm. In this case, the features or patterns extracted are more robust to input noise and will generate a better high-level representation of the data (Vincent et al. 2010). Many methods are generally used to corrupt the input data, which include Gaussian noise, masking noise, etc. A random value with some variance and zero mean is added to the sample for Gaussian noise.

• Sparse AutoEncoder The sparse autoencoder is a variation of normal autoencoder in which the loss function includes a sparsity constraint on the code layer. By doing so, the training objective is to minimize reconstruction loss.

In the autoencoder, the encoder’s task is to calculate probability P(z|x), and on the other hand, the decoder reconstructs input and results in calculating P(x|z). However, the only decoder is parameterized in Sparse AE: the encoder is only used for optimization. Ranzato et al. (2007) proposed common grounds between sparse AE and normal AE. This was first applied for computer vision tasks in advanced statistical methods and pattern recognition. It was proposed that z space can be free the same as in a sparse algorithm, but encoder can be added, which works as the same way as in normal autoencoder also penalty score can be added which tell us the difference between the output of the encoder and also nonparametric code z. It helps in achieving two goals reconstructing the input while being close to the output of the encoder.

(26)

3 Methods

This chapter presents the modern methods, which help decision making in recommender systems more efficient, accurate, and fair. Due to enhancement in technology and the researcher’s constant focus in the domain of recommender systems, new methods have been adopted, which make recommender systems more reliable. After the evolution of autoencoders, there has been some enhancement made in them, and Variational Autoencoders (VAE) is presented as the latest technique in the Collab- orative Filtering task for recommendations. One important aspect of recommender systems that is often ignored is its fairness. When we do predictive modeling using machine learning and deep learning techniques its is very important that algorithm does fair treatment all of its objects, especially in case of recommendations. Once we have the ranking list of predicted items, top items are shown to the user, the problem here is that in the list of top items the items in the first place have scores which are very similar to each other. Now, which item to show first is the big issue because when the relevance is almost the same between the list of items, all items have equal chances in the list. To address this issue in our experiments, we used the concept of amortizing fairness presented by Biega, Gummadi, et al. (2018) in which attention in rankings and relevance gained should be proportional. It paves the path for long term fairness.

3.1 Variational AutoEncoders

The autoencoders are designed in a way that given input data is transformed into an encoding vector where each dimension designates some learned attribute about the data. The encoder network outputs a single value for each encoding dimension.

The task of the decoder network is to take these values and recreate the original input as closely as possible.

The intuition of variational encoders infers from the goal of latent variable modeling given by Murphy (2012). A variational autoencoder (VAE) presents a probabilistic method for describing an observation in latent space. However, in autoencoders, the single output of encoder examines features of every latent state, the task of an encoder is to present distribution for every latent feature. If we suppose latent space, denoted by z having K-dimensions, we can derive P_θ(z) for z ∈ Z as probability density function, whereθ is density parameter of the function.

The relation between input y and latent encoding vector z can be fully defined by:

• Prior P_θ(z)

• Likelihood P_ϕ(y|z)

(27)

• Posterior P_ϕ(z|y)

The observed data points Y = y₁, .., y_n are modeled as P_ϕ(y_i|z), so that the overall likelihood of Y can be defined by marginalizing over Z:

P(Y) = ∏

i

P (y_i) =∏

i

∫

P_ϕ(y_i|z)P_θ(z)dz (3.1) In reference to the classical VAE framework proposed by Diederik, Welling, et al.

(2014), z is being distributed according to a normal distribution as z ∼ N (0, I_k).

If we examine the equation 3.1, we can notice that P_θ(z) is used to parameterized P_ϕ(y_i|z), which represents any change of z that shows such a dependency. A neural network is used for transformation, and ϕ refers to the group of network params which need to be updated. Our aim here is to maximize P(Y) by finding an optimal parameter setϕ. Variational inference (Blei et al. 2017) handles this by introducing Q(z|y) as an approximation function, which estimates P(z|y), which is true posterior.

Loss Function: ELBOThe estimated Q(z|y) should be very close to true posterior P(z|y).To quantify between these two distributions we can use Kullback–Leibler divergence. KL divergence D_KL(Q∥P) measures the lost if Q is used to represent distribution P. We want to minimizeD_KL(

Q(z|y)∥P₍z|y)) . We have the following equation,

D_KL(

Q(z|y)∥P₍z|y)) .

=∫

q_ϕ(z|y) log^q_p^ϕ^(z^|^y)

θ(z|y)dz

=∫

q_ϕ(z|y) log^q^ϕ^(z_p^|^y)p^θ^(x)

θ(z,y) dz

=∫

qϕ(z|y) (

logpθ(y) + log^q_p^ϕ^(z^|^y)

θ(z,y)

) dz

= logp_θ(y) +∫

q_ϕ(z|y) log^q_p^ϕ^(z^|^y)

θ(z,y)dz

= logp_θ(y) +∫

q_ϕ(z|y) log_p ^q^ϕ^(z^|^y)

θ(y|z)p_θ(z)dz

= logp_θ(y) +E_z_∼_q_ϕ_(z_|_y) [

log^q^ϕ_p^(z^|^y)

θ(z) −logp_θ(y|z) ]

= logp_θ(y) +D_KL(q_ϕ(z|y)∥p_θ(z))−E_z_∼_q_ϕ_(z_|_y)logp_θ(y|z) Finally, we have

(28)

with Q(z|y). The right-hand side of the equation is variational lower bound or Evidence Lower Bound, ELBO with addition of Q used for optimization.We model Q(z|y) = N(z|µ_λ(y),Σ_λ(y)), so that the term KL [Q(z|y) P(z)] has a closed form .We will model the parameters µ_λ(x) and Σ_λ(y)) through a neural network trained on y and parameterized by λ.

The modeling is comprised of two neural networks: the first one is Encoder and second is Decoder. Encoder encodes that input y into latent variable z by Q_λ(z|y). On the other hand decoder’s task is to decode the latent space z into the corresponding y byP_ϕ(y|z). The right hand side of equation 3.2 gives the loss function computed on training data y.

Figure 3.1 Variational AutoEncoder (Borges and Stefanidis 2019)

A problem with loss is that it is approximated by sampling the values z w.r.t Q(z|y). To solve this, Diederik, Welling, et al. (2014) proposed reparametrization trick: rather we sample z, we can sample an noise variable ϵ according to a fixed distribution P(ϵ), and obtain z by means of a differentiable transformation depending on λ, ϵ and y. For sampling we can use Gaussian noise ϵ ∼ N(0,I, and obtain z_λ(ϵ, y_i) = µ_λ(y_i) + Σ_λ(y_i)·ϵ, normally distributed according to parameters µ_λ and Σ_λ. For example,commonly the structure of Q(z|y) is multivariate Gaussian.

z∼q_ϕ( z|y⁽ⁱ⁾)

=N (

z;µ⁽ⁱ⁾,σ²⁽ⁱ⁾I) After Reparameterization trick:

z=µ+σ⊙ϵ, whereϵ∼ N(0,I)

(29)

Given the loss function we can conclude:

L(ϕ, λ;Y) =∑

i

{ 1 2

∑

k

(σ_λ,k(y_i)−1−logσ_λ,k(y_i) +µ_λ,k(y_i)²)

−E_ϵ_∼_N_(0,I)[logP_ϕ(y_i|z_λ(ϵ, y_i))]}

3.2 Sequential Model

The framework for sequential variational autoencoder has been proposed by Sachdeva et al. (2019) which is extension of framework mentioned in above section. This model extension is based on time-aware user preferences. In probabilistic framework,temporal dependencies can be modeled by conditioning each single event to its previous historical event which is in sequencey_(1:T₎:

P ( y_(1:T₎)

=

T∏−1 t=0

P (

y_(t+1)|y_(1:t))

(3.3) In the above equation we can see that y_(1+T₎andy_(1:T₎has recurrent relationship between them which drawn by P(

y_(t+1)|y_(1:t))

.The temporal dependencies can be handled through conditional VAE.Considering a simple generative model

z_u(t) ∼ N(0, I_K),π( z_u(t))

∼exp{ f_ϕ(

z_u(t))}

yu(t) ∼M ulti( 1, π(

zu(t)

)) (3.4)

The joint likelihood as referred in equation 3.4 would be P (

y_u(1:T₎,z_u(1:T₎)

=∏

t

P (

x_u(t)|z_u(t)) P (

z_u(t))

(3.5) By using proposal distribution, we can formulate the approximate posterior concerning proposal distribution as follows.

Q_λ(

z_u(1:T₎|y_(1:T₎)

=∏

t

q_λ(

z_u(t)|y_(1:t₋₁₎)

(3.6) In equation 3.5,we have Gaussian distribution Q_λ(

z_u(1:T₎|y_(1:T₎)

having parame- tersµ_λ(t)andσ_λ(t)which are dependent on current history with respect to recurrent layer. Here we put recurrent layer ht.

h_t=RN N_λ(

h_t₋₁,y_u(t₋₁₎)

(3.7) One can observe the architecture of the sequential model by referring to equation 3.3. In the sequential model,z_u(t)depends on the recurrent relationship in the layer.

We can retrieve the information from the history by using the dependency of the latent variable from the recurrent layer.