Food determinants defining self-rated health

(1)

Beata Rantaeskola

RANDOM FOREST APPROACH ON GROCERY PURCHASES AND SOCIODEMOGRAPHIC PROFILES ON SINGLE AND 2-ADULT HOUSEHOLDS

MASTER THESIS

FACULTY OF ENGINEERING

AND NATURAL SCIENCES

Mikko Hokka

Jari Viik

4/2021

(2)

ABSTRACT

Beata Rantaeskola: Food determinants defining self-rated health – Random Forest approach on grocery purchases and sociodemographic profiles on single and 2-adult households Master Thesis

Tampere University

Degree Program in Material Science, MSc (Tech) April 2021

Eating patterns are found to effect hugely on health and diet is typically studied with food diaries, questionnaires, and 24h-recall. They typically don’t give the real-life information about eating habits when they are usually disturbed by conscious and unconscious biases. Therefore, it is desired to discover new ways to study diet. With grocery purchase (GP) data these biases can be minimized when data is gathered objectively during a long-time phase.

GP data for this study is obtained from LoCard-project, which is a cooperation research between Tampere University, Helsinki University and Suomen Osuuskauppa. LoCard data covers all the purchases made with electronic S-loyalty card between 9/2016 – 12/2019 from the customers, who participated on the research. In addition to GP data, all participants filled a questionnaire in summer 2018.

Food purchase data from a year 2018 were selected for this study. Demographic information sex and age were obtained from a customer owner register. In addition to them education level, household size, current main activity, and self-rated health were added on the data from the survey. Self-rated health is rated by one household member between good, quite good, moderate, quite bad, and bad health.

Study is performed with machine learning classification technique, called Random Forest (RF).

It is an ensemble technique, which is used as combination with supervised learning model called Decision Tree. RF consist of many decision trees and the algorithm split the data continuously onto two different groups that splits the dataset the most effectively on to different SRH classes.

Each tree is voting for the most popular class and RF combines the results of every decision tree and forms a prediction of the most suitable class with majority voting. Method is chosen because its suitability for large datasets, high classification accuracy, easy use, and effectiveness. RF is typically used in the beginning of the study to see bigger picture of the most important and effective splitting variables of the dataset.

Aim of the study is to see do household’s grocery purchases determine self-rated heath and if they do which variables and how. Random Forest method ability to detect and characterize the association with grocery purchases and self-rated health is researched. Study focuses on 1 and 2 adult(s) households and studies also the difference in purchase habits that may occur between studied two age groups. Customers that have over 60% loyalty degree with grocery purchases are used.

This thesis approach is a novel way to study people diet effects towards SRH among population level. Method is using variables that splits the self-rated health most effectively on different SRH classes. Food variables that have high classification importance could possibly indicate customer’s self-rated health and reveal similar patterns in purchase behaviour between different SRH states. Material of the study opens new way of studying population purchases objectively but still with high detail. Purchases are likely to reflect population dietary patterns, and its association to SRH.

Study pointed out that grocery purchases seem to highly associate with SRH class. There are many variables, that split the data by lower or higher purchase amount onto different SRH classes.

RF algorithm performs with quite high accuracy with modified dataset. RF predictions of SRH classes were correctly classified mostly onto the right class or its adjacent classes. Therefore, RF seems to detect the relationship with SRH and classify person’s SRH by household’s grocery

(3)

purchases. Age groups had similarities and differences between grocery purchases association towards SRH.

This new methodical approach with Random Forest is able to reveal many different association types, which were descendent and ascendant linearity, non-linearity and stair function with threshold value. Therefore, RF seems to perform well and bring a new perspective when some of the findings are harder or not possible to reveal with other models. RF method can open new ways to see how people’s grocery purchases can influence on their SRH.

This study creates information that can be used to provide healthier products and knowledge to customers and therefore improve the overall health of the whole population. Food retailers can for example give alternatives to customers in decision making situations, where some unhealthy choices can be replaced with better ones. Research pointed out new findings, and a scientific article will be made on this thesis subject.

Keywords: Random Forest, Classification, Health, Machine learning, Decision Tree, Grocery Purchases, Self-rated Health, Supervised learning

Thesis is approved with Turnitin.

(4)

TIIVISTELMÄ

Beata Rantaeskola: Food determinants defining self-rated health – Random Forest approach on grocery purchases and sociodemographic profiles on single and 2-adult households Master Thesis

Tampere University

Degree Program in Material Science, MSc (Tech) April 2021

Ruokavalion on todettu vaikuttavan suuresti terveyteen ja sitä tutkitaan tyypillisesti esimerkiksi lyhyt kestoisilla ruokapäiväkirjoilla ja kyselylomakkeilla. Niiden käyttö on altis virheille, eivätkä ne välttämättä anna realistista kuvaa syömiskäyttäytymisestä, koska niiden käyttöön liitetään yleensä tietoiset ja tiedostamattomat kirjausvirheet. Sen takia uusia tapoja tutkia ruokavalioita halutaan löytää. Päivittäistavarakaupan ostotietojen avulla nämä ongelmat voidaan minimoida, kun tietoja kerätään pitkällä aikavälillä objektiivisesti.

Tämän tutkimuksen ostotiedot on saatu LoCard-projektista, joka on toteutettu Tampereen yliopiston, Helsingin yliopiston ja Suomen Osuuskaupan välisenä yhteistyötutkimuksena. LoCard- tiedot kattavat kaikki sähköisellä S-kanta-asiakaskortilla 9/2016 - 12/2019 välillä tehdyt ostot tutkimukseen osallistuneilta asiakkailta. Ostosten lisäksi kaikki osallistujat täyttivät kyselylomakkeen kesällä 2018.Tähän tutkimukseen valittiin elintarvikkeiden ostotiedot vuodelta 2018. Demografisista tiedoista sukupuoli ja ikä saatiin asiakasomistajarekisteristä. Näiden lisäksi tutkimuksen tietoihin lisättiin koulutustaso, kotitalouden koko, nykyinen päätoiminta ja koettu terveys. Koettu terveys arvioitiin hyvän, melko hyvän, kohtuullisen, melko huonon ja huonon terveyden välillä.

Tutkimus suoritetaan satunnaismetsällä (Random Forest), joka on ensemble - luokittelutekniikka. Tätä tekniikkaa käytetään yhdistelmänä päätöspuu- koneoppimismallin kanssa. Satunnaismetsä koostuu monista päätöspuista ja algoritmi luokittelee tiedot jatkuvasti kahteen eri ryhmään, jotka jakavat tietojoukon tehokkaimmin eri koetun terveyden luokkiin.

Jokainen puu äänestää suosituimmasta luokasta ja satunnaismetsä yhdistää jokaisen päätöspuun tulokset ja muodostaa ennusteen käyttämällä enemmistöäänestystä sopivimmasta luokasta. Menetelmä valitaan, koska se soveltuu suurille tietojoukoille, sillä on korkea luokittelutarkkuus, se on tehokas ja helppo käyttöinen. Satunnaismetsää käytetään tyypillisesti tutkimuksen alussa suuremman kuvan saamiseksi aineiston tärkeimmistä ja parhaiten jakavista muuttujista.

Tutkimuksen tavoitteena oli nähdä, määrittävätkö kotitalouden päivittäistavaraostot koettua terveyttä. Jos näin tapahtuu, muuttujat ovat identifioitu ja niiden vaikutusta tutkittu.

Satunnaismetsän kykyä havaita ja luonnehtia yhteys ruokaostoksiin ja koettuun terveyteen on tutkittu. Tutkimuksessa keskitytään 1 ja 2 hengen aikuisiin kotitalouksiin ja tutkitaan myös ostotottumusten eroja kahden eri ikäryhmän välillä. Asiakkaita, joiden sitoutumisaste on 60% tai yli käytetään tutkimuksessa.

Tämä tutkielma avaa uuden näkökulman tutkia ihmisten ruokavalion vaikutuksia väestötasolla suhteessa koettuun terveyteen. Menetelmä käyttää muuttujia, jotka jakavat itsearvioidun terveyden tehokkaimmin eri koetun terveyden luokkiin. Ruokamuuttujat, joilla on suurin vaikutus luokittelussa, voivat olla yhteydessä asiakkaan itsearvioituun terveyteen ja näyttää niiden vaikutuksen eri koetun terveyden luokkien välillä. Tutkimuksen materiaali avaa uuden tavan tutkia väestön ostoja objektiivisesti ja silti erittäin yksityiskohtaisesti. Nämä ostot heijastelevat todennäköisesti väestön ruokatottumuksia ja niiden yhteyttä koettuun terveyteen.

Tutkimuksessa havaittiin, että päivittäistavarakaupan ostot näyttävät selvästi osoittavan koetun terveyden luokan. Metodi löytää monia muuttujia, jotka jakavat tiedot niiden pienemmällä tai suuremmalla ostomäärällä. Satunnaismetsä-algoritmi toimii melko suurella tarkkuudella muokatulla aineistolla. Satunnaismetsän ennusteet luokiteltiin enimmäkseen oikeaan tai oikean luokan lähimpiin viereisiin koetun terveyden luokkiin. Tästä syystä metodi näyttää havaitsevan

(5)

suhteen koetun terveyden ja luokiteltavan henkilön kotitalouden ruokaostojen välillä. Eri ikäryhmien välillä löytyi yhtäläisyyksiä, mutta myös eroja koetun terveyden ja ruokaostojen välillä.

Satunnaismetsä uutena metodologisena lähestymistapana näyttää löytävän monia erilaisia yhteyksiä, jotka ovat laskeva ja nouseva lineaarisuus, epälineaarisuus ja porrasfunktion kaltaisen kynnysarvo. Menetelmä näyttää toimivan hyvin käytetylle datasetille, näyttäen havaintoja, joita on vaikeampi tai mahdoton mallintaa muilla malleilla. Kyseinen menetelmä avaa uusia tapoja nähdä, miten kotitalouden ruokaostot voivat vaikuttaa terveyteen.

Tämä tutkimus antoi tietoa, jota voidaan käyttää tarjoamaan terveellisempiä tuotteita ja tietoa asiakkaille ja siten parantamaan koko väestön terveyttä. Ruokakaupat voivat esimerkiksi tarjota asiakkaille parempia vaihtoehtoja päätöksentekotilanteissa, joissa jotkut epäterveelliset valinnat voidaan korvata paremmilla. Tutkimus toi esiin uusia havaintoja, ja tästä aiheesta tehdään tieteellinen artikkeli.

Avainsanat: Satunnainen metsä, Luokittelu, Terveys, Koneoppiminen, Päätöspuu, Ruokakaupat, Itsearvioitu terveys, Ohjattu oppiminen

Opinnäytetyö on hyväksytty Turnittimella.

(6)

PREFACE

I started to work this thesis in the end of January 2021. I learned many new things during this process and was able to combine my knowledge in the engineering field together with human health, which has been my interest for a very long time. I understood that my career path is meant to combine these fields. In the future I want to work with prevention of diseases in the populational level.

I’m grateful for my supervisors Jaakko Nevalainen and Jari Viik for helping me during this process and for giving me the opportunity to do this thesis. I could not have hoped for better supervisors – thank you for your support and guidance. I’m also thankful for the support of my family and for friends that were able to get my thoughts elsewhere in my free time. Biggest thanks belong to my partner who supported me daily during this process.

This education has taught me good information collection skills, critical thinking, and the fact that I can learn whatever I want to. The way of ending my school path was not the way I had imagined due to this pandemic, but now it’s time to close this chapter in my life.

I’m looking forward to the future!

Tampere, 30 April 2021

Beata Rantaeskola

(7)

LIST OF FIGURES

Evaluating self-rated health is studied to consist of many phases.

(Jylhä 2009) ... 5 SRH reporting behavior have variation in comparison with life’s

outlook among age and gender. (Modified from: Layes et al. 2011) ... 7 Data can be divided onto categorical and numerical data. This

effects on the choice of machine learning model. (Subramanian et al. 2018) ... 11 In machine learning process, raw data is pre-processed, split and

then used in training and evaluation process. Real data is used to

create predictions by the created model. (Jeyaraman et al. 2019)... 13 Machine learning has three categories – supervised,

unsupervised, and reinforced learning model, which all are

learning differently. (Subramanian et al. 2018) ... 14 Supervised learning method is using labeled data and learns by it.

Learning accuracy is tested with test data, which gives the

predictions of the classes as an output. (Subramanian et al. 2018) ... 16 Random Forest model uses majority voting. (Subramanian et al.

2018) ... 23 Distribution of the self-rated health classes in the studied LoCard-

dataset. ... 28 SRH distribution in two age categories. ... 33 Distribution of SRH classes of the modified dataset, which is used for main study. Two lowest health groups are combined to

represent one class called bad health, which is formed with 609

observations. ... 37 15 most important variables in main study classified with Random Forest. ... 39 Two heatmaps made with same variables, differing with SRH. In

left side average SRH of the raw data classes is used, where in right SRH uses predictions made with RF. Heatmap made with predictions shows more clearer results, which are easier to

interpret... 42 Main activity ~ Sausage heatmap. Blank square is lacking

observations. Briefly, higher sausage consumption seems to lower Self-rated health among almoust all main activity statutes. Main activity influences a lot on the SRH values, even before sausages

are purchased. ... 44 Fresh fruit ~ Fresh vegetables heatmap is showing that increased

vegetable purchase amounts increased SRH the most. Fresh fruits seem to associate with SRH unregularly, seen as the C-curve in the heatmap. As a combination with high vegetable values, medium fruit purchases increase the SRH the most. Alone they

don’t seem to associate towards SRH. ... 45 High-fiber cereal ~ Sweet Pastries heatmap shows linear pattern

to SRH. The more fiber cereals are bought and the less sweet

pastries the better SRH is. ... 47 15 the most important variables with under 45-year-olds. Main

activity played the key role in tree construction process.

Chocolates, fresh fruits, sugar sweet softs and fresh vegetables

were the most important food variables. ... 50

(9)

The most important variables in Random Forest with over 45-year- olds. Most important variables were almost same than with main

analysis. ... 51 Heatmaps of Sausages – Fresh Vegetables purchases with

household under 45-year-olds. Vegetables seems to increase SRH and sausages vice versa. ... 52 Heatmaps of Sausages – Fresh Vegetables purchases with

household over 45-year-olds. Effect of these variables seems to

remain same as with younger households. ... 53 Heatmap of Cheeses ~ Fresh Fruit with under 45-year-olds.

Purchases have vertical effect on SRH. Consumption of fresh fruits increase SRH hugely among the younger age group. ... 55 Heatmaps of Cheeses ~ Fresh Fruit with over 45-year-olds.

Purchases have linear effect on health. Higher consumption of

both fresh fruits and cheeses increases SRH hugely. ... 56 Heatmap of chocolates and sweets under 45-year-olds are

showing little more complex pattern. But generally, increase with

both variable purchase values are associated with lower SRH. ... 57 Heatmap of chocolates and sweets with over 45-year-olds is more complex. Higher purchase amount of chocolates and quite low

sweet purchase amount seem to increase SRH. ... 58

(10)

LIST OF NOTIONS AND ABBREVIATIONS

Bias Systematic mistake in the data

Big Data Large amount of constantly growing unorganized data

CT Classification Tree

Decile Division of data into (ten) equal parts

DT Decision Tree

GP Grocery Purchase

GPD Grocery Purchase Data

GPQI Grocery Purchase Quality Index

IDS Imbalanced Data Set

MDA Mean Decrease Accuracy

MDG Mean Decrease Gini

MER Modified Error Rate

k-NN K-Nearest Neighbours OOB Error rate Out-of-bag error rate

RF Random Forest

SRH Self-Rated Health

SVM Support Vector Machine

(11)

1. INTRODUCTION

Health is important for every person in this planet and science is trying to find out which things effect on it positively and negatively. Everything influences on humans – the environment where we live in, the air we breathe, people around us, genes we are born with, the food we eat, and the way we sleep. That makes human health a complex subject to study. Health can be studied in multiple ways at individual or populational level depending from the object(s) under consideration. Population health science studies health’s promotion, lifestyle diseases and lifestyle behavior at populational level, whereas for example medicine, and molecular biology focus on studying individuals and generalizing that information onto the populational level. Individual level gives more detailed information, whereas population level provides general information of the larger number of people and little information on the individual level. Ideal situation would be to combine both – specific individual level information together with big sample size.

Traditionally studies on people’s eating habits and its association with health have been performed with tools like diaries, interviews and questionnaires that evaluate the individual’s self-reported eating within a short-measurement period. These methods are usually disturbed by conscious and unconscious biases and represent well only the individual level. This problem is occurring also with new innovative dietary assessment technolo- gies. (Illner et al. 2012).

Grocery Purchase (GP) data would possibly overcome this problem, with its long period of time approach on dietary behavior study. It gives wide perspective on purchase behavior at individual household level and opens totally new opportunities and ways to study human health at populational level. GP are bought by household, but also presum- ably eaten by the household members. The benefit of using GP data is its objectivity, data includes all GP that are recorded with electronic loyalty card. GP data gives opportunity to track individual household eating behavior when purchases are assumed to equal with consumption. GP reveals typical purchase behavior for example among age groups and education, but also immediately how policy and food prices are affecting on it.

This thesis provides new methodical approach on studying the patterns that may occur in customers grocery purchases, which can explain their self-rated health. The study is performed with grocery data and survey, where background information and self-rated

(12)

health is asked. Answerer of every household reported their health in range from good to bad. Data analysis of the study is done with Random Forest ensemble method, where the most suitable food and background determinants are identified without hypotheses made in advance. Machine learning method approach to study grocery purchase analysis is a novel way of analyzing dietary patterns in Finland. This type of data is valuable and new, therefore bringing a lot of novelty value in this field of research. This thesis has a multidisciplinary nature and it combines both biostatistics and technology approach. A scientific publication will also be made on the subject.

This study’s main research questions are:

- Do grocery purchases determine self-rated health (SRH) and if so which foods among grocery purchases they are?

- Is Random Forest method able to detect and characterize the association with grocery purchases and self-rated health in a meaningful way?

- Are there similarities or differences among the food determinants between age groups that determine self-rated health?

First objective is defining the relationship between GP and SRH. If associations between SRH and GP appear, can be assumed that these variables might indicate SRH. Random forest’s (RF) ability as a new approach to study GP is analysed. Age groups are examined to see how food variables influence on SRH between older and younger population.

This thesis is divided into theory, materials and methods, results, discussion, and conclusion. Chapter 2 reviews theory, which is a combination of many research fields, in- cluding self-rated health, grocery data and machine learning process and its data analysis tools. Classification methods and ensemble techniques are reviewed in more detail, which gives a background understanding of the method used in this study. Chapter 3 reviews the used research material and methods. In the chapter 4 results are viewed and evaluated. Chapter 5, discussion analyzes the results, RF performance and capa- bilities, and strengths and weaknesses of the study material. Results are compared with previous nutritional studies. Conclusion in chapter 6, is summarizing the results and gives an overview and future implementations of this study.

On Appendix A the decile values of observed food variables are seen. In appendix B heatmaps that were not handled in detail are reviewed. Study wanted to focus more on detailed analysis on specific variables, rather than giving a quick glimpse of all. Appendix C includes heatmaps that were also not included in this study. These had interesting results and could be studied in more detail in the future.

(13)

2. THEORY

In this chapter self-rated health and its accuracy as a health evaluation tool is analyzed.

Grocery purchase data is reviewed, and its possibilities and weaknesses are studied.

Datasets, machine learning process and its most relevant statistical analysis methods are opened. Supervised learning method is handled in more detail. After that the meaning of an ensemble method, especially Random Forest is clarified - why they are used, how they work in practice and why they are popularly used as a combination with machine learning models.

2.1. Self-Rated Health

Health is a complex matter and consist of many different areas that effect on it. It is constantly changing, but also experienced differently. Overall human health is defined by WHO in year 1948 as “a complete physical, mental and social well-being and not just merely the absence of disease or infirmity” (WHO 2021). This definition is criticized by its impossibility to achieve ().

Sickness is typically described as exception to the normal. Term normal on the other hand is hard to distinguish from the abnormal because humans aren’t the same. Medi- cally speaking health can be measured with many tests, e.g. blood tests for hormone and vitamin levels, sports performance and with many different meters like waistline, heart rate variability, ECG and so on. Result values are hard to generalize to represent the health stages that are kept as normal and referred as healthy. (Duodecim 2020) Therefore, some guidelines are created to illustrate what kind of range can be described as normal. These are for example blood measurement variable’s reference values.

Self-rated health (SRH) is defined as people’s subjective opinion of their general health state (Tilastokeskus 1995). It is defined as a very informatic indicator of human full well- being. SRH survey technique is widely used in health research for its easy use, low cost, and power as a health measure. SRH measuring doesn’t require cognitive impairment, still performing valid and reliable results. (Bombak 2013) It has potential applications in monitoring populations health state (Ware et al. 1993).

SRH is considered as the world’s easiest meter on human health analysis and the world’s most used health indicator in all kind of health-related studies. It has many names: Self- rated health, Self-assessed health, and self-perceived health, which all mean the same.

Typically, self-reported health isn’t used in this situation. Interest towards it increased,

(14)

when were understood that SRH indicated well many things like people’s mortality, future performance, and health care need. (Jylhä 2021)

SRH is a subjective reporting method. It might be misleading when these health questions are also a very personal and wide range. Despite the subjective nature of this question, indicators of SRH are founded to be good predictor of elderly’s multimorbidity (Mossey & Shapiro 1982) and, also future health care use (Palladino et al. 2016). SRH’s ability to measure mortality is studied to decline with age (Layes et al. 2011). Medium correlations with subjective well-being and health status survey has been found in meta- study, where association has been found stronger in more developing countries (Nga- maba et al. 2017).

Newest study of SRH is showing consistent relations with the various biomarker levels measured in blood and urine and SRH. Used biomarkers are indicating the function of the organs and the processes of the human body. This study showed that people who rated their health lower are also showing poorer biomarker levels. In contrast, these levels were better when good SRH was measured. Thus, results are showing that SRH can be regarded as a “solid biological basis”. (Kananen et al. 2021)

SRH assessment is a process (Figure 1), which is evaluated in contextual framework.

The answerer considers firstly cultural and historical health, which include diagnoses, performance, symptoms, sign of illnesses e.g. medicine or sick leaves. Statistical con- nection is forming by the things that the answerer understands or wants to measure his/her own health, and which therefore are considered in the assessment. Younger people tend to add genetics, future aspects, and health habits on the contextual framework.

After this first part health is analyzed more in general and typically compared with other people, life expectations, and earlier health experiences. Then all these things are taken into consideration and compared to preset health options. The most suitable option for SRH is chosen. (Jylhä 2009)

SRH is measuring non-specific health. It doesn’t tell the answerer what to consider in decision making – rather leaves answerer space to evaluate health as points that are important to the person her-/himself. This is validated by studies, where SRH independent prediction force towards mortality is the lower, the more other health measures or tools are considered. SRH can be said to perform more holistically. It covers the variance of the body state, not the existence of diseases but their severity. (Jylhä 2021)

(15)

Evaluating self-rated health is studied to consist of many phases.

(Jylhä 2009)

SRH is regarded as an age calibrated meter, because older people don’t find difficulties that problematic rather accepting their health issues more in relation to younger population. Elderly tend to report better SRH in comparison with younger people who have the same problem. This is a cause of the fact that it is a general expectation to have more health problems at older age. On the other hand, people are prudent to rate lower their health when transient health problems are not considered to lower the self-rated health.

SRH is also a context specific meter and rating depends on the culture. What is regarded as normal varies between different cultures. This results that SRH can’t be measured between (totally) different cultures. (Jylhä 2021)

Surveys assessing health have been used for a long time and their use has been observed back over 300 years. Back then health surveys didn’t have a scale construction.

To improve their functionality Ware et al. (1993) standardized Short Form-36 (SF-36) Health Status survey, which is a conceptual instrument to measure health. This Short

(16)

Form-36 has also an even shorter abbreviate called SF-6D, which is widely used in examination of health economics. Both instruments include the self-rating of health. (Ware et al. 1993)

This easy question: How do you see your health state? has played key role as health indicator for over 70 years. (Jylhä 2009). When health status is needed to ask only by one measure, Fayers & Hays (2005) suggested rating to be as by following: “excellent, very good, good, poor or fair”. Rating is following the same idea between different researchers, where Ware et al. (1993) standard follows out quite same word choices: a scale from excellent, very good, good, fair to poor. Where in other hand WHO (1996) recommended scale of “very good, good, fair, bad and very bad”. In many researches SRH surveys have shown to perform as well as other health status measures. They can be also performed with less money than assessments made with an expert. (Mossey &

Shapiro 1982).

Things associating with SRH are more complex. Jylhä pointed out (2009) that “hardly any other measure of health is more widely used and more poorly understood than SRH”.

Diseases, different symptoms, lifestyle, and physical performance are major key areas affecting the personal assessment of answerer’s self-rated health state (THL 2015). Nu- merous studies have investigated what factors are associated with SRH. Factors typically examined include demographic factors (age and gender), socioeconomic factors (income, education, and unemployment), health behaviors and place. (Jylhä 2009). A Chinese population study also showed correlation with increased SRH rating and higher dietary knowledge, behavior, and attitudes. Association with decreased SRH rates in the study were the opposite. (Yang et al. 2020) Also, other studies showed correlation with lower SRH and poorer dietary habits (Hsieh et al. 2018, Conry et al. 2011).

People who have healthier lifestyles tend to report more pessimist rate of their health.

Over 80-year-olds and people with less income or lower education tend to report more optimistic health (Figure 2). (Layes 2011) Nowadays third of Finnish adult population have reported their health status to be moderate or under. And older population is reporting their self-rated health to be poorer than younger population. (THL 2021) This pattern is also seen in Layes et al. study in Figure 2, where optimistic and realistic SRH reporting behavior is studied. Even this rate is more optimistic than reality, SRH seems to decline with age.

(17)

SRH reporting behavior have variation in comparison with life’s outlook among age and gender. (Modified from: Layes et al. 2011) Self-rated health reporting depends on gender (also seen in Figure 2), where women tend to report more optimistic than men. Usually women are reporting more wide perspective of health than men, but this is also seen in people with different life periods.

People can compare their health state also more in relation to same aged friends. SRH has been often asked to evaluate with same age population. (THL 2015) The formulation of the question plays an important role of the research. Also, the way how answerer is guided towards the answer, can influence on their rating behavior. (Tilastokeskus 1995) Therefore quantitative measurements are harder to perform, when the same thing is wanted to measure. Because SRH surveys can be a powerful tool in research, it’s important to highlight for researchers to understand the behavior of the answerer to perform valid results (Layes et al. 2011).

2.2. Grocery Purchase Data

Nowadays new data is generated enormously every minute around the world for example in traffic, factories, and social media. This data is collected by authorization and companies and it can be used as valuable information in all decision-making situations and in operational development. Data is referred as companies’ asset and it is growing its mar- ket share more and more every day. For example, a food retailer company can collect data from its customers grocery purchases (GP). The company can save data from bought products, date, total sum, customer id and amount of food etc. This data provides valuable information for companies, but also to researchers. Humans also consume data when they want to get personalized services to fulfill their needs. (Alpaydin 2014) Companies can use GP data for marketing communication and specialized promoting, customer profiling and segmentation, and offering complementary or higher cost prod-

(18)

ucts to customers. GP data, if available for research purposes, can be used in populational eating behavior research and public health assessment. GP data includes many households, still also providing high detail information of an individual household purchase behavior. (Nevalainen et al. 2018)

Food is one part of a healthy lifestyle. Eating is studied to strongly associate with healthier life and prevention of diseases. Varied eating, food quality and quantity are major key areas of healthy eating. Higher consumption of nutrient dense foods, vegetables, berries and fruits, high quality oils in relation to low consumption of processed food with empty calories, trans fats and sugars are studied to be the “food” key to healthy life. (Ilander et al. 2014) Mediterranean diet is studied to have association towards good health and known to include a lot of vegetables, fruits, and olive oil (Willett et al. 1995). Berries, olive oil, leafy greens, nuts, and fatty fishes are preferred due to their anti-inflammation impact (Harvard Health Publishing 2020) Foods parameters as a cause of inflammation are studied and they can be scored by their inflammatory potential with six inflammatory biomarkers (Shivappa et al. 2014).

The western diet is studied to cause low-grade inflammation. Western diet typically includes a lot of energy and inflammation increasing foods as a combination with foods that lack micronutrients and decrease inflammation. (Barbaresko et al. 2013, Cordain et al. 2005) This type of pro-inflammatory diet is studied in meta-analysis to associate with increased risk of both cardiovascular disease and related mortality (Shivappa et al. 2018) and increase mortality in adult general population (Bonaccio et al. 2016).

Grocery Purchase Quality Index-2016 (GPQI-2016) is a validated assessment tool to analyze household grocery purchases quality. Index includes 11 different food components which are formed with US Department of Agriculture’s Food Plan models. Food purchases are scored by different components and points are indicating the healthiness of household’s food purchase behavior. Index has possible applications in the future to describe eating behavior in populational level. (Brewster et al. 2017)

GP data can be a helpful tool to analyze people’s eating behaviors in population level.

Data is a by-product of customer shopping and almost “free” for research purposes. On the other hand, grocery purchase data is not freely available for research purposes, while its use needs negotiations and data collection acceptance from the customers. Grocery data has a huge sheer size and enables data collection for a very long time. GP data also enables research with specific details like what is exactly bought, how much and how often. Before the use of GP, dietary assessment was harder to study, because diet

(19)

data was collected by food diaries, interviews, and surveys. There data collection is related with a tendency to report more healthier habits than they are in practice and performed within short time-period. In countries where loyalty rates are high with specific retailers, opportunity to collect data lowers the bias and, also provides low cost and valuable information of the population diet. What makes this type of data interesting is that high loyalty customers eating patterns are easier to see when data collection happens objectively during long time. (Nevalainen et al. 2018) Purchase behavior can vary for example during holidays and special days, but these exceptions don’t stand out that much in the bigger picture.

Challenges with GP data occurs because some specific purchases can be all bought from another retailers (Nevalainen et al. 2018). For example, vegetables and fresh eggs can be bought in the local markets and some special foods in other smaller stores. Cus- tomers can eat in restaurants and working places, which makes this type of data harder to generalize in daily eating habits.

Also needs to be pointed out that purchase doesn’t equal to consumption, when household member doesn’t necessary eat all foods that is being bought – they can be bought to another family member(s) or to some other use (Nevalainen et al. 2018). With lower household sizes, this issue with purchases and consumption can be assumed to decrease.

Eating behavior is studied to evolve during the first years of children’s life, meaning that dietary patterns are formed very quickly and learned usually by parents, that are making the food choices for the family. (Birch et al. 2007) Therefore preferred purchase behavior is commonly used in daily lives, when some way is proved to be good. Thus, it can be assumed that typically people are eating the same way, even if food is bought from other retailers. And therefore, it is realistic to assume purchases equal to eating and consumption.

2.3. Machine learning

Epidemiology is typically referred to be part of health studies. In epidemiology researchers are searching the causal links in environment and lifestyle that are affecting health and causing sicknesses. It is health and disease research at the population level.

(Krickeberg 2012) Key methods to study epidemiology rely on statistical inferences. Bi- ostatistics are using mathematic calculations to study biological sciences (Iuliano &

Franzese 2018). For example, health status, mortality, infant mortality, morbidity, and life expectancy of the human population can be studied.

(20)

In statistics, normally used methods to perform analyze are scales, variability, means, dependency indicators, and more ambitiously, a range of statistical models for different data types. Statistical reasoning plays an important key role in scientific hypothesis testing. This hypothesis method is a traditional way of doing data analysis. (Kestenbaum 2009) Statistical analysis aims to generalize things from individual to general population.

Analysis can make statistical key figures for example about health indicator, which is a relevant measure of public health. This key figure is for example daily smokers in Finland.

(Koponen et al. 2019) Statistical modeling can help to understand the phenomena.

Carrying out statistical surveys requires a lot of planning, as it creates a basis for what can be extracted from the data in the end. A sufficiently representative group of samples are needed for statistical analysis. If the number of selected targets form a representative set, questionnaires can be generalized to represent the whole population. Data analysis are forming higher level models and information, which are enabling the conclusion making in different situations. (Subramanian et al. 2019)

Massive amount of data, referred as big data, has many possibilities when plenty of data is available. It’s important to know which data is relevant to use and collect, so only the important data is stored and analyzed. There are also problems with big data, and machines are used for help. (Alpaydin 2014) Machine learning purpose is to reveal struc- tures that can’t be seen from an uncategorized data. Data can be changed into more useful way by machine learning algorithms and intelligent actions can be performed with it. (Brentz 2015) Algorithm can learn from the dataset, without a need for a separate define process made by the researcher.

Nowadays some statistical analysis can be performed with machine learning methods.

Machine learning has a lot of common to traditional statistical analysis because both are making predictions of the material or learning from it. Machine learning method was commonly used even before computers. (Hastie et al. 2009)

Machine learning is helping humans on decision making situation, because human potential is limited – humans can’t deal effectively and economically advantageously process huge amount of data. Machines in other hand can keep up with the massive flow and complexity of data (Brentz 2015). But also, it needs to be noticed that machines alone aren’t perfect, and they need to be guided by humans to perform successfully in their assigned task. Machine learning methods are practical in situations where lapse occur. (Subramanian et al. 2018) Machine learning is making predictions with a created model and model is derived from data. (Jeyaraman et al. 2019)

(21)

Data analysis needs a dataset for examination. Datasets contains rows, where each row of a data set is called a record. Each data set can have multiple attributes on each col- umn, each of which gives information on a specific characteristic. For example, in the dataset of students, there are four attributes namely roll Number, name, gender, and age. Each of which understandably is a specific characteristic about the student entity.

Attributes can also be termed as feature, variable, dimension, or field. (Subramanian et al. 2018)

There are multiple steps to be considered with data before machine learning applications. Main points to be outlined are data type and quality. As Subramanian et al. (2018) pointed out data can be divided into two data types, which are qualitative and quantitative data (figure 3). Qualitative data or categorial data can’t be measured and it’s further split to nominal and ordinal data. Nominal data has namely value e.g. gender which doesn’t have a numerical scale of measurement. The most frequently appearing nominative can be investigated and named on the data. Ordinal data can be arranged in some order, like good, moderate, bad, where some values are better than another.

Data can be divided onto categorical and numerical data. This ef- fects on the choice of machine learning model. (Subramanian et al. 2018) Quantitative data can be measured, and it’s further split into interval and ratio data. In- terval data can be mathematically divided, and it can get mean values like difference between temperatures, but not a true zero value (temperature does always exist). In other hand, there is a zero value with ratio data – and an example of this is age and salary. (Subramanian et al. 2018) By understanding different data types, right approach to the dataset can be performed.

Data exploration and preparation makes the quality of the data better. (Lantz 2015) By data understanding potential issues can be outlined and misleading results avoided. Out-

(22)

liers of the data set should be deleted when they might corrupt the whole analysis. Re- moving outliers isn’t needed to be performed if the amount of them is small. Dataset can have also missing- values. They are needed to remove if the elements after removal are still sizeable. Missing values can be also replaced with imputation method, where this is replaced with mean, mode, or median value of the same attribute. Data analysis must be done carefully, while data privacy is really important factor to consider. It has raised head among machine learning, where confidential data can be also handled. (Subrama- nian et al. 2018)

Machines learn by the process, which is now handled in general (figure 4). The process mimics the way humans are learning and does it by statistical and mathematical rules.

Theory of the process can be simplified into four categories, but in reality, the process is much more complex and changes between applications. First part of this process is data storage. Everything starts with data. Raw data is just only zeros and ones in the stored memory, before it’s formed to have a meaning in future steps. Raw data provides basis facts for the abstraction. Next step, called abstraction process, forms this raw data to have more meaning. Data can be for example grouped, graphed, or mathematically equated. Raw data is molded onto more structural form (Subramanian et al. 2018). Pro- cess results new knowledge, which shows relation between studied data elements. Typ- ically, data is distributed onto training and test data. Training data is applied firstly to abstraction process, where data is forms into concepts.

Next step called generalization forms the most suitable inference for created model. In this step created abstracted data models are trained and the best one is chosen and utilized in the future steps. Step reduces the most irrelevant findings and keeps the important ones. Last step is evaluation, where the success of used learning algorithm is measured by heuristic search. Evaluation of algorithms performance is done with test set data, which the algorithm hasn’t seen before. After this created model can be used for research data and it forms predictions that can be used for study purposes. (Subra- manian et al. 2018)

(23)

In machine learning process, raw data is pre-processed, split and then used in training and evaluation process. Real data is used to create pre-

dictions by the created model. (Jeyaraman et al. 2019)

Quality of the data effects enormously on the success of machine learning method. To perform well, data is needed to be well prepared and cleaned. (Jeyaraman et al. 2019) If the quality of the used data is low the model can be affected or become even invalid by bias, which is systematical incorrection. Bias can occur with every machine learning algorithm when all of them have their own weaknesses and intelligibly not performing perfectly. (Lantz 2015) Appropriate compilation of the material and the good data quality allows the phenomenon interpretations to be generalized.

The created model can be overfitted or underfitted to the used data. This issue is causing unwanted behavior of the model. When model is overfitted it has learned too specific and irrelevant behavior which decreases the accuracy of the model. With underfitted model some important patterns aren’t captured in the learning phase, and the model is too simple. Therefore, when used data is unseen model creates easily wrong predictions.

(Subramanian et al. 2018) This problem is also depending on the chosen model. Balance in fitting process has a vital role in determining the performance of the model. Thus, is important to find the balance so the necessary patterns are recognized.

Machine learning process can be performed with three different tasks (figure 5), which are supervised learning, unsupervised learning, and reinforced learning. In supervised learning the task of the machine is create a predictive model that describe the past infor- mation and predicts a label for each unknown test data. Supervised model uses labeled

(24)

data and can handle two kind of predictions types. When predictions are made with known class supervised model uses classification. Regression is used when predicted variables have a value in a continuous scale. (Subramanian et al. 2018) More focus on supervised learning is given in section 2.4.

Unsupervised learning differs with supervised learning because it organizes itself the similar objects together. Unsupervised learning discover grouping and makes associations between data elements. It tries to find patterns from unlabeled data. Unsupervised method is usually referred as descriptive model and it can be performed with clustering and association analysis methods.

Typical example of this model type is customer segmentation. Clustering is organizing unlabeled data with similarity measures to belong in the same clusters. Most used similarity measure is distance, where the higher distance between studied data items are generally founded to belong in the different clusters. Association analysis is finding association rules in the data and often focuses on finding relationship in a behavior. Exam- ple for this type of method can be performed with grocery backet analysis, where grocery transactions are analyzed. Learning is trying to find patterns that occur when some item is bought. This transaction of one item is typically leading to purchase of item with another and so on. For example, butter can be found to associate with bread purchase.

Unsupervised learning is more difficult to put into action than supervised model. (Subra- manian et al. 2018)

Machine learning has three categories – supervised, unsupervised, and reinforced learning model, which all are learning differently. (Subrama-

nian et al. 2018)

(25)

Reinforcement learning is happening automatically with raw data, which doesn’t consist of labels or classes. This process is functioning by reward system when subtasks are successfully complete. Model is also punished for its mistakes. (Subramanian et al.

2018) Model is trying to maximize its future rewards by updating the basic outcome with reward prediction error. (Lee et al. 2019) Reinforced learning is used when there is no idea of in which class the variable belongs to. Method is the most complex and harder to bring into use. (Subramanian et al. 2018)

2.4. Supervised learning

When this study uses labeled data with a known class the choice of supervised learning is easy. Thus, this learning method is handled more deeply next. Supervised learning model can be used in classification and regression. With classification method machine is predicting a variable that can be either categorical or continuous. Supervised methods are working differently, and they all have their strengths and weaknesses. Widely used classification methods are Decision tree, Naïve bayes, Support Vector Machine (SVM) and k-Nearest Neighbor (k-NN). When estimator has a numeric value, the method is called regression. Most used regression model is linear regression. (Subramanian et al.

2018)

Supervised learning model is using past information to classify the data, then learns and experiences from it. Humans are not involved on this process mainly, but the supervised name comes from it, when method gives opportunity a learner to supervise how well the created model is trained and achieves the target values. (Lantz 2015)

All data used in supervised learning are labeled and they have a known class. Quality of the labelled training data effects tremendously on the reliability of the predictions made with supervised learning method. With poor training data, predictions are not correct.

Supervised learning is used when there is knowledge of how to classify the data. (Subra- manian et al. 2018) Supervised learning is successfully applied e.g. onto handwriting recognition, disease prediction and stock markets (Lantz 2015).

To learn, supervised method uses labeled training data as an input (see Figure 6). Su- pervised model is forming suitable algorithm to perform its given task, with term called fitting. It searches similarities and tries to reveal patterns from the data that are combining specific class members together. Basic information of training data is labelled with all different features wanted it to learn. After prediction model is created by algorithm, test data is used to see how accurate the created model is performing.

(26)

Supervised learning method is using labeled data and learns by it.

Learning accuracy is tested with test data, which gives the predictions of the classes as an output. (Subramanian et al. 2018)

With unseen test data model accuracy with the created predictions is evaluated. When class of test data observations are known beforehand, created predictions by the model to the actual and real classes can be compared. Accuracy of the classification model is shown as percentage error rate of the predictions. This rate takes notice only predictions that were correctly classified onto the right class. Predictions are made with test data when model is reaching the needed accuracy limit. With continuous variables the accuracy can be evaluated with residual values, which are the distances between the actual and predicted values. (Subramanian et al. 2018)

Imbalanced Data Set (IDS) can cause problems in the supervised learning process when the studied classes have different number of observations. This is causing a lot of difficulties in the classification process. Typically, algorithms are driven by accuracy. When the amount of the minority class is noticeable smaller compared to majority class, minority class doesn’t influence the general accuracy of the model hardly at all. Therefore, this class’s classification can be very poor, which can’t be seen in the overall prediction’s accuracy. Algorithm doesn’t also consider the size of the error with false classified predictions. Therefore, IDS issue is handled to achieve better classification results among all studied classes. IDS can be fixed with oversampling the minority classes or with under sampling the majority class in the learning phase. New sampling method called SMOTE can be used to create new artificial observations to the minority represented class. This reduces the imbalance issue. (Visa & Ralescu 2014)

(27)

Regression is trying to find relationship between depended variable and independent variables. Dependent variable is modeled as function of independent variables. The sim- plest and widely used regression method is called linear regression model. There variables relationship is assumed to follow a straight line, 𝑦 = 𝑎 + 𝑏𝑥 + 𝑒𝑟𝑟𝑜𝑟 .The machine job is to identify a and b. Where y is an independent variable, that are to be predicted and x are depended on variables, called predictors. Their relationship can be positive or negative. Straight line is predicted to fit the most on every observation point by least squares method. This method is trying to minimize the errors between observations and the created line, i.e. minimizes the sum of (𝑦 − 𝑎 − 𝑏𝑥)² over all datapoints. Regression method can be used in true-false type hypothesis testing, where researcher defines a hypothesis, e.g. tobacco is affecting negatively on health and then the algorithm is used to test the hypothesis. (Lantz 2015, pp. 171 – 218) Also, method extends to non-lineari- ties by fitting e.g. 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥²+ 𝑒𝑟𝑟𝑜𝑟. (NCSS 2021)

There are multiple classification machine learning methods to choose from. K-nearest neighbor (k-NN) is data classifier, which is also called a lazy learner. Actually, this method is not learning anything, while the process skips the abstraction and generalization steps explained in machine learning process above. Method classifies data by plac- ing it in similar category than it nearest neighbor. K-NN uses distance function, which matches the data with its neighbors and chooses the most suitable label. Method is good for classifications when classes are easy to split and they differ from one another. K represents the chosen number of neighbors and it affects mostly on data overfitting and underfitting. The choice of K-value is very important, when bigger k can cause an igno- rance of small but the most important patterns. This k-NN method is very simple and widely used, mostly because it’s efficiency. In other hand it provides hardly any understanding of feature relations to a class when method doesn’t produce any model and becomes slower with increased data volume (Lantz 2015 pp. 65 – 87)

Bayesian methods are estimating the probability of an event by trials that can make the event to occur. Bayes theorem is used to construct this relationship, which estimates the probability of an event by another event and its evidence. So, it’s showing the relation between these dependent events. Naïve Bayes method is learning by probability and it’s the most used Bayesian method. Basic idea is to predict the outcome of the hypothesis by its evidences. Method naiveness comes from the fact that, it makes assumptions that all the dataset features are equally important. In real-life its vice versa and some features are more important than others. Naïve Bayes is used typically for text data. For example, this method is typically used to perform weather estimates and text classifications e.g., spam email filtering. Bayes is functioning well with noisy data by its simple and effective

(28)

method. It can be also used for numeric data by binning the data onto categories, but it’s not ideal method for dataset containing many numeric values. Bayes model weakness is also the fact that estimated probabilities aren’t that trustworthy compared to predicted classes. (Lantz 2015 pp. 89 – 124)

Decision tree (DT), also called binary tree can be used for classification and regression.

It’s a predictive tree-like machine learning model and it belongs also to the group under supervised learning method. Models structure mimics a tree which includes leaves and branches. DT begins with a root node and in each node classifies and divides subset repeatedly onto two smaller subsets (branches), called decision nodes and this process is repeated several times. (Breiman et al. 1984)

When classes are known beforehand DT classifies the subsets to predefined classes by chosen the attribute that splits the different classes most effectively. This dividing is done by algorithm which chooses the best candidate attributes. (Lantz 2015, p. 126-127) Fea- ture subsets and rules can vary during the classification phase, if needed. (Du & Sun 2008). Splitting process continues and new descendants are forming. Descendants that no more split, are called terminal subsets. These terminal subsets are demonstrating different classes, which are in supervised model known beforehand. Terminal subsets can differ from each other or represent the same class label. Thus, two or more pre- sentative classes can occur, depending on the number of the studied classes. (Breiman et al. 1984, p. 20-21) Tree model is created with training data and the accuracy of this model is evaluated with test data.

There is multiple splitting criteria for categorical variables that can be used with decision tree classifier. These are for example Gini-index, chi-Squared statistic, and information gain. (Lantz 2015 pp. 135) Empirical comparison of these criterion is performed by Mingers (1989). They all act differently and can improve the performance of the data, if the right one is used for the dataset.

Information gain is using entropy change for each split and selects the split with lowest entropy or highest information gain. Split process is ready when homogenous nodes are achieved. Chi-Squared statistic is traditional way to measure connections within chosen variables performed in contingency table. Gini-index is defined as “the mean difference from all observed quantities”. Gini-index also achieves homogeneity of the node like information gain but doesn’t contain logarithms which are computationally more sensitive.

(Ceriani & Verme 2011)

(29)

DT method is also called divide and conquer because it is a powerful classifier. (Lantz 2015) Classification tree is used to produce an accurate classifier or uncover the predictive structure of the problem. (Breiman et al. 1984) Decision trees will show the connections between features and potential outcomes. The main issue with classification tree is to know when to stop splitting and focus to use good attributors. (Breiman et al. 1984, p 23). Also, small variations in the data may result as totally different looking tree. This is a result of the learning phase, where different training data is used to create the model.

Thus, the outcomes of individual tree can vary largely, when only one individual tree makes prediction. Therefore, they are little more complex and may perform with lower accuracy. (Yiu 2019)

Support Vector Machine (SVM) is a black box method, which can be used on defining the correlation between features and outcomes. SVM is trying to form hyperplane that is creating a homogenous partition. Maximum margin hyperplane is used to create the most suitable separation and it uses support vectors as the points that are nearest to the most suitable class. This process requires and rely on very complex math. Linear approach is done by using higher dimension spaces with created new features. Kernel trick is used as help to find nonlinear relationships between variables in real-life applications. SVM method combines k-NN method and linear regression. Formed combination performs powerful tool to view intricate relationships. Math with SVM is complicated and finding the most suitable model is performed by testing different parameters and kernel combinations. When the training phase last longer, SVM is slower to use. SVM results might be also harder to understand. In other hand SVM is hard to overfit and doesn’t influence much with noisy data. Model usually performs with high accuracy. (Lantz 2015)

2.5. Ensemble techniques

There are multiple ensemble techniques that can be used as combination with supervised learning model. Ensemble algorithms can’t be used alone, they are combined with a chosen model. These techniques are improving effectiveness and accuracy of the used methods. Ensemble techniques are successfully used to e.g. improve feature selection, error correction and confidence assessment. They bring different opinions to the model to help decision making, similarly like everyone do daily in their lives. Algorithm uses different kind of majority voting, which is functioning either in a row or in line. Most popular ensemble techniques are bootstrap aggregation (bagging), boosting, random forest, and stacking. (Zhang & Ma 2012, pp. 1 – 17) Ensemble method is a combination of various weaker models, which are forming all together the overall prediction model. With the combination of weaker models’ better results are achieved.

(30)

Boosting is focusing on the observations that are the hardest to classify. Algorithms is reducing the bias by raising wrongly classified observations weight by iteration process.

When first weak model is created, its weakness is tried to correct on the next model and so on. Therefore, more notice is given to correct the prediction errors. After n models are formed, this ensemble method increases the accuracy of the model. (Rocca 2019) Bagging decreases the variance of the model without increasing the bias. Bootstrapping is resampling method which creates modified samples from the original sample by randomly choosing and replacing the observations. Number of datapoints remains the same, but bootstrap sample can contain also same datapoints repeatedly. (Subramanian et al. 2018) Typically data is not available enormously when true independent samples are studied. Therefore, bootstrapped samples are considered as good alternative to represent these samples and they are considered as “almost-independent” samples.

Where created model is always affected by the used dataset, bagging is trying to minimize it. Algorithm fits these bootstrapped sampled together with weak models. Their combination forms an average output of them and therefore lowers the variance of the model. (Rocca 2019) Bagging improves especially the volatile estimates of high-dimen- sional and wide datasets (Biau & Scornet 2016).

Stacking is improving the predictions. It is less used method than boosting or bagging. It combines many different models that all can learn one part of the problem, but not the whole problem. These created meta-models are together forming an accurate prediction.

Meta-model can use for example kNN-classifier and SVM together (Rocca 2019), where the most optimal combination is used. (Sarkar & Vijayalakshmi 2019)

2.6. Random Forest

Random Forest (RF) algorithm is a specific type of bootstrap aggregating (bagging) en- semble technique. As said before ensemble techniques can’t be used alone and they are combined with machine learning model. RF is formed with many individual decision trees (DT) that together are creating a forest. To understand RF and its randomness, firstly the concept of DT needs to be internalized.

RF is firstly introduced by Breiman (2001). The method is improved version of bagging technique performing more suitable modification to combine with DT method. Trees are bootstrap aggregated, where each DT is built by repeatedly resampling training data and voting the trees for a consensus of prediction. RF is a technique, which can perform classification- and regression analysis and, also their combinations. (Jeyaraman et al.

2019)

(31)

Every tree in RF is built with different combination of training data, when each DT is constructed by using a different bootstrap sample. The training data for each tree is drawn from the data by uniformly sampling and with (or without) replacement. During the construction of each tree each node split involves the random section of a subset of 𝑘 variables. (Dimitriadis et al. 2017)

RF uses approximately 2/3 of the original data to build each tree. Observations which were not used for the training process of the specific tree are used to evaluate the accuracy of constructed model. These observations are approximately 1/3 of the data, forming the test data for each tree. Test data is usually referred as out-of-bag (OOB)-obser- vations and with them the performance of the model is evaluated. Each time after the end of the construction, predictions for each sample are summarized. The final class prediction strategy for each sample is chosen by majority voting. (Subramanian et al.

2018) Feature that gets the most votes on the trees of random forest, is voting for the most popular class (Breiman 2001). So, RF combines the results of decision trees and forms a prediction of the most suitable class. (Jeyaraman et al. 2019)

The most important variables effecting on the performed random forest classification can be ranked by their importance. Certain variables may appear as the split criteria more than once and some variables may never appear in the tree splits. There are two common importance measures with RF and these are Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) also referred as Gini Importance or Mean Decrease in Im- purity (MDI). (Menze et al. 2009)

MDA is showing how much accuracy changes in RF if a single variable is taken away. If the variable is not important, it should not decrease the accuracy of the model. MDA uses OOB- error estimate in the measurement process. To measure the importance of the ith variable, the values of variable X(i) are randomly permuted in the out-of-bag observations and put down the tree. MDA is the averaged difference between out-of-bag error estimation before and after the permutation over all trees (Biau & Scornet 2016) MDG is showing the order of importance of the variables measured by Gini-Impurity index. Method was introduced by Breiman (2001). MDG indicates how often a specific variable was selected to perform a split, and how large its overall discriminative value was for the classification problem under study. (Menze et al. 2009) Gini impurity is used for measuring the quality of a node split during the construction of a decision tree. Here quality means how well the split separates the samples of different classes in the considered node. Method uses small subset 𝑘 of strong variables and chooses the one of them that split the node into two classes the most effectively. In every node subset split

Food determinants defining self-rated health - Random Forest approach on grocery purchases and sociodemographic profiles on single and 2-adult households

Beata Rantaeskola