• Ei tuloksia

4 EXPERIMENTAL RESULTS

4.2 Device grading

This section presents the data set, evaluation process and results of the grade classifica-tion evaluaclassifica-tion. First the process of assigning ground truth grades to the physical devices is presented. After that aspects of the evaluation process, such as evaluation method and the hyperparameters used in the evaluation are presented. In the last part of this section the grading results of support vector machine and random forest are presented.

4.2.1 Dataset

The data set used for evaluating the grading performance consisted of 37 smart phones.

All the devices were black iPhone 7 models. The devices were not used for gathering the defect classification dataset, described in 4.1.1. All the devices were analysed with SCORE tester. The defect information was extracted by using the detection algorithm, described in the sections 3.3.1 and 3.3.2. The ground truth grades for the devices were assigned by human operators. The inspection was done by five operators with physical devices. Based on the grade descriptions presented in table 3.4, the operators were asked to classify each device to the most appropriate grade class. The final grade for the device was then assigned based on the median grade value given in the manual inspection. Distribution of the ground truth grade values is presented in the figure 4.1.

Figure 4.1.Distribution of ground truth grades of the devices used in grading evaluation.

4.2.2 Feature sets

Input features used in the evaluation were extracted with the defect detection algorithm, described in section 3.3.1 and 3.3.2. Raw defect information was used to construct statis-tical features from the detections. The statisstatis-tical features calculated from the detections are presented in table 3.3. Different combinations of defect features were used in the

evaluation, to find out the optimal set to base the grading. In addition, a feature set calcu-lated from the detection data without classifying the defects, was also evaluated. Different feature sets are presented in table 4.4.

Table 4.4. Feature combinations used in grade classification evaluation.

Feature set Defects considered Number of features

1 Dust 8

Generally accepted tuning strategies for random forest are hard to find, and the benefit of tuning a random forest is not certain [70]. Slight improvements on performance might be achieved with hyperparameter tuning, but generally good results can be achieved by just setting the number of trees in the forest high enough. Empirical studies have shown that generally best results are achieved by using up 100 trees [71]. Number of trees used in this study was set to 100, as it was the default value for scikit-learn implementation.

Unlike the random forest, performance of SVM model is highly dependable on the used hyperparameter values, as discussed in section 2.2.5. For the grading evaluation RBF kernel function was used, as it has proven to be well performing choice in different learn-ing scenarios [72][73]. The range of hyperparameter values forγ andCto be used in an iterative search, were selected based on the scikit-learn documentation [29], and related studies [74]. The iterative search is a valid method for optimizing SVM, since only two parameter values are optimized [75]. Values for the two hyperparameters used in the search are presented in table 4.5.

Table 4.5. Hyperparameters used for optimizing SVM.

Parameter Values

γ 10−3,10−2, ...,102,103 C 10−3,10−2, ...,102,103

4.2.4 Results

Metrics used in grading performance evaluation were accuracy 2.13, precision 2.14 and recall 2.15. The metrics were calculated after each outer fold of the repeated nested cross-validation, presented in 2.2.7. The final values for the metrics were calculated by averaging the results. Standard deviation of the results achieved in different folds were also calculated. Number of splits used in the cross-validation was set to 5 and number of repetitions to 10.

Grading results achieved with SVM and random forest are presented in tables 4.6 and 4.7. The best grading performance was achieved with SVM model by using feature set number seven, presented in table 4.4. Feature set number seven consisted features calculated from all of the classified defects.

Table 4.6. Grading results of achieved with SVM.

Feature set Accuracy Precision Recall

Table 4.7. Grading results of achieved with Random forest.

Feature set Accuracy Precision Recall

5 DISCUSSION

The two research question studied in this thesis were

1. How defect features could be extracted from the images taken with SCORE?

2. What is the grading performance of the developed grading implementation?

Solution for the first research question was searched with a mixed method, combining literature review and experimental research. Based on the previous work, many ap-proaches and techniques were found, that have been used for detecting defects in dif-ferent surface materials. The found techniques were examined in terms of appropriate-ness to the grading task. One requirement for the grading task was that is should not be resolved with completely opaque reasoning. This requirement excluded most of the found approaches, and the selection for the detection method was decided between two approaches. The two approaches were using a deep learning-based object detection network, or a hybrid method combining traditional computer vision deep learning. Hy-brid approach was selected, as it was not as data intensive approach, and the detection scenario was considered simple enough, to utilize traditional computer vision techniques.

The second research question, and partly the first, was studied with an experimental approach. In total 57 second-hand smartphone were imaged with the SCORE tester. Im-age data of 20 devices were used for developing the defect detection algorithm, and 37 were used for evaluating the whole grading process. The grading performance evaluation was conducted by analysing the 37 devices with the implemented defect detection algo-rithm and using the extracted features to predict the grades. Prediction was done by two classifier models: support vector machine and random forest, and the results were com-pared. These learning models were selected, since they have commonly achieved high performance in related scenarios. Ground truth grades for the devices were assigned by comparing grading, done by human operators.

Based on the results, support vector machine achieved the best performance for correctly grading the devices on average. The feature set that yielded best results consisted of features calculated from all the detected defect classes. Using the feature set, support vector machine achieved 95.5% in accuracy, 96.1% in precision, and 96.0% in recall.

Lowest grading performance was also achieved with support vector machine, by using features calculated form grease defects only. In the that case, achieved accuracy was 71.7%, precision 75.1%, and recall 71.5%.

Highest performance on average with the random forest model was achieved with feature

set constructed from scratch and dust defects. The difference to best results achieved with the support vector machine, was 0.7% lower in accuracy, 0.7% lower in precision and 1.1% lower in recall. The size of the deviation in results for both models was close to each other. Based on these findings it can be said that both models had as much variability in the results, but on average support vector machine achieved better performance.

Difference in grading results achieved with different feature sets were big, but not as big as assumed. Assumption would be that the features calculated from only grease or dust defects would not be much greater than a random guess. This is because these defects should not relate to the actual grade of the device. This is because the human operators performing the grading, were not instructed to consider dust or grease defects in the grading. Hence there should be no correlation with dust and grease detections and the actual grades. However, even with only using dust defects, the accuracy, precision, and recall were well over 80% for both of the examined classifier models.

Explanation for this is that the proposed detection system is not accurate enough, in detecting any specific defect. The devices that have a lot of defects will also have lot of detections. Some of the actual scratches will be misclassified to grease and dust, as shown in the confusion matrix 4.3. Therefore, a device with lot of defects will have in general more dust and grease detections as well, and just that can give enough in-formation to do the grading. This can be observed from the grading results of random forest when using features calculated from unclassified defects. The classifier exceeded 92 % in accuracy, precision and recall on average, without considering different defect classes. This indicates that relatively high grading performance can be achieved without even classifying the defects. However, feature sets utilizing defect information, and more specifically defect information calculated from scratch defect, outperformed the grading done with features, that were not considering scratch, or those which used features from unclassified data. Therefore, the results indicate that classifying the defects improved the grading performance.

By examining the results, it can be said that the proposed system was able to extract meaningful defect information from the images, to base the grading. The grading perfor-mance of the system was high, as it achieved over 95 % on all of the used metrics. The results indicate there is potential in hybrid approaches that utilize traditional computer vi-sion and deep learning, and it can be effectively utilized in certain scenarios. Explanation for the effectiveness of the selected approach related to the simplicity of the detection task. The key characteristic for a defect in this scenario was non-uniformity in a plain background.

Based on this study, conclusion about if the implemented solution is the optimal to solve the automatic grading task can not be made. The number of devices used in the eval-uation was small, and they were all similar to each other. Grading results performed with different smartphone models might differ a lot, in case the defect detection needs to be performed on different type of environment. For example, some of the smartphone models might have a pattern printed to the front frame, or the pixel pattern of the display

might be visible in raw images. In these cases, the characteristics of the device need to be considered during the detection, to prevent misdetections.

Main impact of thesis is that it can help the future development process of the SCORE product. Alternative approaches for the defect detection were presented, and their ap-propriateness to the task was evaluated. Different approaches should not be completely rejected, even though they were not selected to the implementation presented in this study. Some of the alternatives might be useful for the future defect detection needs as the product evolves, and more feedback is received from the customer companies.