• Ei tuloksia

3.4 Implementation

3.4.1 Data processing

Ground truth

The ground truth that is used for the train of a model is changing every time since the nurses are finding new ways to describe a seizure and identify it. There is a new

termi-nology that can be added to these seizures or there are some events that can be really identified as negative samples but are normally confused as positive. For these different scenarios it is necessary to have a training system scalable that reads from different lo-cations according to the labels of the lolo-cations and does not read directly from the files what kind of descriptors or types it has. The reason for this is since there are some old data that uses the classification method by type and that represent important seizures, but they get lost when the training is set up to be by descriptors labeling since this event does not have a descriptor.

In order to make the training more scalable, the training of a new model need to be changed to read the ground truth from specific locations and by this perform the labeling of the samples by the providedd locations. Meaning that all the annotations of positive samples should be saved in a same location in order to be able to read all these positive samples into the same file and labeling as positive, without taking in consideration what format of annotation contains. The three main labels that are needed to be in considera-tion at the moment of performing a training are

• Positive samples.

• Negative samples.

• Irrelevant samples.

Meaning it will be necessary to provide the location of these three different samples. This location can be a bucket in S3 or folders in a local computer. However, each one of the directories should contain JSON files that correspond to the patients that are going to be used to train a new binary classifier. At the same time that all of these JSON files should contain at least these values

• Begin: the timestamp of when the event starts.

• End: the timestamp for when the event ends.

Since the files are given according to a certain location there is not necessary that con-tains extra information since the label it will be assigned according to their location.

• Positive samples labeled as 1.

• Negative samples labeled as -1.

• Irrelevant samples labeled as 0.

Normally in a classification problem, will be not always negative and positive samples.

The irrelevant samples correspond to the fragments of the video that in effect are a seizure but are not the kind of seizure that is desired to be detected, in this case, such sample is not required to be detected and therefore it will not be at all in the training of the model.

Negative samples can be generated randomly from the signals explained in Section 2.3.2 by windows of 20 seconds and that has a match with other signals. Since there can be generated a huge number of negative samples before fitting the data to the training of

a classifier, the negative samples are reduced according to the specified negative ratio threshold. Another way to select the negative samples can be done after training the first model. The inference output from this model can be used to guide the selection of false-positive outputs. After trimming too short samples those can be used as negative samples to create an even more accurate model. This approach can improve the training dataset by adding negative samples that are easily confused as positive samples.

The processing of the ground truth will be performed in a single script by just reading the JSON files and storing them in a single dataset with the values

• Patient id.

• Begin: timestamp of the beginning of the annotated seizure event.

• End: timestamp of the end of the annotated seizure event.

• Y: label indicating it is a positive, negative or irrelevant sample.

Signals

In order to perform the training of a binary classifier, some predefined signals and events require transformation and processing in order to feed them to the training of the classifier, as it was explained in Section 3.1. However, this group of data can change as new signals and events are extracted from the recording video of the patients, that can be added to the group of predefined signals and events or can replace them. In order to maintain the data processing algorithm fast and adaptable to new types of data it is necessary to make the algorithm concise and focus only on the target group of data to use. This is enabled by designing the algorithm to just download and transforming the data necessary to train the classifier, all in a single step. This by avoiding intermediate steps of download all the data, filtering, and cleaning the data before process it in the correct format to obtain thef features.

Since the tools to analyze the events from a video continue advancing, and the presence of new signals that can be correlated to a biomarker is also increasing it is necessary to present an algorithm that adapts to any kind of data and does not expect always the same kind of data. For being more specific expect the dimension of the data will not be always 1, 12, or 25 as it was referred in Section 2.3.2 but that dimension of the data depends on each file and the algorithm should infer it from it. The current implementation was expecting to receive the dimension number of the data, for example in the case of oscillation that are 12 different values from the 12 different thresholds the data that is saved in the files are in the format of dimension 12. These files of n dimensions need to be split into single time series in order to match the format of the other signals and generate from them the features specified in Section 2.3.4.

In this new implementation, the dimension of the file assumes once it read it so it should not request extra information from the user or predefined values in the code in order to run the program. This makes the code more automated it to read new kind of signals that

can be in a future generated by the pipeline.