The confusion of a classifier is a regularly used metric to analyze which particular predictions often get confused with certain other classes. With the confusion matrix it is possible to visualize those patterns effectively. However, not only the frequency by which a class gets confused with another one is of high interest, but the average confidence these wrong predictions have on the incorrect class also play a vital role for assessing the trustworthiness of those predictions. This is visualised in the mean confusion confidence matrix. Together, the two matrices help to uncover systematic errors that need to be addressed in order to obtain a reliable model.
The most commonly confused class is cat getting wrongly classfied as dog at a rate of 16.8% with an average confidence of 68%.
Ground Truth Class | Prediction | Confusion Rate (%) | Average Confidence (%) |
---|---|---|---|
cat | dog | 16.8 | 68 |
dog | cat | 9.8 | 65 |
bird | deer | 8.5 | 67 |
automobile | truck | 6.4 | 78 |
bird | airplane | 5.9 | 71 |
airplane | ship | 5.6 | 79 |
bird | dog | 5.6 | 67 |
horse | deer | 5.6 | 69 |
horse | dog | 5.3 | 59 |
cat | deer | 5.1 | 64 |
Shown below are the confusion matrix and the mean confusion confidence matrix. The confusion matrix shows for each ground truth class (y-axis) how often it is incorrectly predicted with another class (x-axis). The mean confusion confidence matrix shows for each ground truth class, the average confidence with which it has been confused with a respective other class. Please note, that only the most critical classes are displayed in each matrix in case the number of classes is too high.
Regarding confusion, the worst class is cat for which 16.8% of all samples are incorrectly classified as dog. It is followed by class dog (9.8% classified as class cat) and class bird (8.5% classified as class deer). Of all incorrect predictions, class cat predicts 49.1% as class dog. That shows a significant bias and the reason for that could be some systematic issues in the data that should be addressed, e.g., semantically highly similar classes.
Additionally, it is important to consider, how confidently pairs of classes are confused. Here, the worst class, dog, when confused as class ship, has an average confidence of 90.7%. Following that are class automobile (mean confidence of 87.3% when mistaken as class bird) and class airplane (mean confidence of 84.6% when mistaken as class automobile). Especially for class dog the mean confusion confidence of 90.7% is very high and it is strongly advised to address this issue.
However, this interpretation must be seen in relation to the overall error rate. If that is sufficiently low for the considered use case, the confusion might not be as critical
Prediction stability measures how much it takes to change the outcome, i.e., the predicted class, of a model by comparing the confidence difference between the top 2 classes. Generally, the higher the value, the better - in combination with an overall good performance - as it demonstrates the robustness of the model. Specifically, the prediction stability should be high for all samples that have been classified correctly and low for all samples that have been classified incorrectly. This would indicate that the model is robust and certain when it is correct and unstable and uncertain when it is incorrect.
For all correctly predicted samples, the prediction stability is 0.85, for all incorrect ones 0.44.
Class | Stability Correctly Predicted | Stability Incorrectly Predicted |
---|---|---|
airplane | 0.83 | 0.53 |
automobile | 0.92 | 0.53 |
bird | 0.76 | 0.45 |
cat | 0.66 | 0.43 |
deer | 0.84 | 0.4 |
dog | 0.81 | 0.42 |
frog | 0.89 | 0.41 |
horse | 0.84 | 0.4 |
ship | 0.95 | 0.46 |
truck | 0.93 | 0.44 |
Additionally, the diagrams below show the distribution of differences in confidence between the top 2 classes for each sample. That means for each sample, the difference in confidence between the predicted class and the second most likely class is taken. Ideally, correct predictions should exhibit high confidence differences, as that is a sign that the model is certain about its decision and small changes won't affect it severly. Incorrect predictions, however, should show small confidence differences, as that would express an high uncertainty of the model in its wrong prediction.
The prediction stability for all correct predictions is very good with 0.85 and shows the robustness of these predictions. The prediction stability for all incorrect predictions is high with 0.44. Therefore, the model is confident in its incorrect predictions and it is strongly advised to address this issue. Regarding the stability of all correctly predicted samples, there is a significant difference between the best performing class ship (0.95) and the worst performing class cat (0.66). Class ship, therefore, is much more stable and likely to be predicted more robustly than class cat. Looking at the stability of all incorrectly predicted samples, the best performing class horse (0.40) shows noticably more instable predictions compared to the worst performing class automobile (0.53). Class horse, therefore, has a greater potential to have its error rate reduced than class automobile.
77% of the correct predictions are very stable with a difference in confidence to the next most probable class of more than 80%. This is good, as it shows a high level of robustness, but could also be a sign of overconfidence of the model. 17% of the incorrect predictions are very stable with a confidence difference to the next most probable class of more than 80%. This should be addressed as it shows a significant amount of highly confident but incorrect predictions made by the model.
Calibration measures the reliability of the confidences for the prediction of an ML model. Well-calibrated models allow to reliably use the prediction confidence for further decision-making. For instance, if the confidence in a prediction is low, additional safety measures can be taken, e.g., switching to a safety mode or consulting a human expert. If the confidences are not well-calibrated, then these safety-measures are either observed too often resulting in a loss of overall system performance or observed too seldom, potentially leading to critical failures.
The expected calibration error is 0.04 while the maximum calibration error is -0.16.
Class | ECE |
---|---|
airplane | 0.05 |
automobile | 0.04 |
bird | 0.1 |
cat | 0.11 |
deer | 0.02 |
dog | 0.03 |
frog | 0.02 |
horse | 0.02 |
ship | 0.02 |
truck | 0.01 |
The diagram below shows the mean confidences for the respective confidence intervals plotted against the accuracy. The line should follow the diagonal as close as possible in order for the model to be well calibrated across all confidence intervals. If the line is at a point below the diagonal, it means for that confidence range it is overconfident, i.e., it reported higher confidences than what was empirically determined. When trusting overconfident predictions, more errors are made by the system than are to be expected. On the other hand, if the line is at a point above the diagonal, then for this confidence range it is underconfident. Relying on these predictions leads to a more cautious system, i.e., the system rejects more samples than necessary.
The model's calibration is ok. Its confidences should be used with caution for further decision making and only if the data expected during the application is similar to the test data. Regarding the expected calibration error for the individual classes, there is a significant difference. For instance, the best class, truck, has an ECE of 0.01 while the worst class, cat, has an ECE of 0.11. Depending on the criticality of the worst performing class, this is a problem that needs to be addressed.
Uncertainty estimation enables the overall system to only trust the ML model if it is confident in its predictions. When to trust the ML model can be adjusted by setting a threshold for the confidence, e.g., the predictions of the models are only considered above a confidence threshold of 80%. However, when discarding predictions based on he confidence, there is always a trade-off between the remaining error and accuracy that needs to be considered.
On the provided dataset, the minimum obtainable error for this model is 0.0% that is achieved when rejecting all samples with a confidence of less than 100.0%. In that case, 12.6% of all samples are still classified correctly.
The ratio of the number of certain but incorrect samples to all samples is called the Remaining Error Rate (RER). For minimizing the overall risk, it needs to be as low as possible. Nonetheless, if a model would always give a low confidence as output, the system would constantly remain in fallback mode and will be unable to provide the intended functionality. Therefore, the ratio of the number of certain and correct samples to all samples - called the Remaining Accuracy Rate (RAR) - needs to be as high as possible to stay in performance mode for most of the time. The diagrams below show the trade-off between both metrics for all possible thresholds on the uncertainty. Ideally, the curve is as close to the upper left corner as possible, demonstrating high performance and low error.
The default accuracy of the model is 83.32% and the error is 16.68%. Using uncertainty, inputs to the model can be rejected if they are below a given threshold, decreasing the error but also the performance. A good uncertainty estimation method could reduce the error without sacrificing too much performance. The provided model can decrease its error from 16.68% to 10.19% by discarding inputs with a confidence of less than 58.50% and sacrificing 5.0 percentage points of remaining accuracy. Similarly, the error can be reduced by 49.94%, from 16.68% to 8.33%, when using a certainty threshold of 65.01%, while still being able to correctly process 75.91% of all samples. The error of class ship can be reduced the most, down to 0.00%, while being able to correctly process 64.60% of all samples of this class. For this error rate of 0.00%, class airplane performs the worst, and is only able to correctly process 0.90% of all samples of this class. Considering the accuracy across all thresholds, class truck performs best with an area under the RAR curve of 0.38, compared to class cat that has an AUC of 0.54. This is a moderate difference that, depending on the criticality of the individual classes, should be addressed. Considering the error across all thresholds, class cat performs best with an area under the RER curve of 0.01, compared to class ship that has an AUC of 0.12. This is a very high difference that, depending on the criticality of the individual classes, needs to be addressed.
As with the previously discussed remaining error and accuracy, it is also interesting to look at the ratio of uncertain predictions for all possible confidence thresholds. It reveals how sensitive the model is towards the confidence threshold. Ideally, there is a clear distinction between the uncertainty ratios for correctly and incorrectly predicted samples over a broad range of thresholds. This would indicate a low sensitivity towards the confidence threshold making the model more robust, as a precise choice of the threshold is not required.
To visualize the uncertainty estimation quality of a method, we plot the ratio of uncertain correct/incorrect to all correct/incorrect samples respectively, highlighting the ability to shift predictions from certain to uncertain depending on the defined threshold. The figure below shows the two ratios for each method. High uncertainty ratios are indicating a more cautious model rejecting more samples as uncertain while low uncertainty ratios indicate a more confident model. For the incorrect samples, the uncertainty ratio should be as high as possible while the uncertainty ratio should be low for the correct predictions. The blue area between the two curves should be as large as possible demonstrating the capability of the model to reject incorrect predictions with high uncertainty while still giving confident predictions for the correct samples.
On the given dataset, the model does not produce confidences under 22.6%. The reason for this can be that the employed uncertainty estimation approach is based on, e.g., the softmax function, where the smallest possible confidence is dependent on the number of classes. This behavior is not critical, however, it should be considered when selecting a confidence threshold. The biggest difference between the uncertainty ratios for correctly classified and incorrectly classified samples is at 90.9%. This means, for this threshold the trade-off between the remaining accuracy and remaining error is the best, not considering any performance or safety/dependability goals. 28% of all thresholds produce trade-offs that are only up to 10% worse. As a consequence, the system is slightly sensitive to changes in the confidence threshold and there is acceptable control over the trade-off between error and performance regarding safety or performance goals.
Similar to the remaining error and accuracy, the prediction rejection analysis unveils the trade-off between the remaining error and the samples rejected by a model based on a confidence threshold.
The prediction rejection ratio is 0.62.
On the given test data, an error rate of 0.0% can be achieved, while still correctly processing 12.0% of all samples. The prediction rejection ratio summarizes the diagram and with a value of 0.62 shows a moderate potential to minimize the errors made by the network by rejecting samples based on their confidence scores.
The receiver operating characteristic (ROC) curve demonstrates the performance of a binary classifier, in this case, the discriminative ability of the classifier to distinguish between in-distribution (ID) and out-of-distribution (OOD) samples. The Area Under the Receiver Operating Characteristic (AUROC) is often used as a threshold-independent metric to summarize the ROC curve.
The AUROC of the classifier is 0.84.
Shown below is the ROC curve, in which the fraction of true positives (TPR, the ratio of correctly predicted OOD samples given all positively predicted samples) is plotted against the fraction of false positives (FPR, the ratio of all in-distribution samples wrongly classified as OOD) for all possible thresholds on the confidence scores. The resulting curve is used to compute the AUROC. A model with only 0.5 as AUROC (black dashed line) has similar TPR and FPR. Ideally, a TPR of 1 and FPR of 0 would be the best case leading to AUROC of 1. But, practically, a steeper curve yields a larger AUROC which indicates a better distinguishing ability of the classifier.
The classifier is able to correctly distinguish most ID from OOD samples. However, there is still a noticeable portion of OOD samples that are not rejected.
What's next?
Want to learn more? This is what Fraunhofer IKS can offer on top.