How to measure the quality of explanations of AI predictions
Explainable Artificial Intelligence: From Black Boxes to Transparent Models
In the previous part of the series, we looked at what components and properties should good explanations have. The first component, understandability, focuses on the extent to which explanations of decisions and behavior of machine learning and artificial intelligence models are comprehensible to humans. Can a person grasp them cognitively?
The second component, fidelity, tells us how well the explanations describe the actual behavior of the model or the entire system.
Defining the properties of a good explanation is the crucial first step to achieve them. We know what we want to achieve – at the input, there is a machine learning or artificial intelligence model (e.g. a neural network), an input (e.g. an image), a prediction that the model predicted for this input and, finally, an explanation. At the output, we want to determine how understandable and faithful this explanation is. The question is – how can we measure the quality of explanations? (Figure 1)
To evaluate the quality of explanations, we distinguish two families of approaches: human-centered evaluation and functionality-grounded evaluation (Figure 2). In this article, we will use some specific examples to introduce human-centered evaluation. In the next article, we will focus on the functionality-grounded evaluation.
In human-centered evaluation, we assume that the recipient of the explanations is a human being and it is, therefore, necessary to involve people in the evaluation process. In the functionality-grounded evaluation, the goal is a quantitative and automated evaluation of the quality of the explanations.
Evaluating the quality of human-centered explanations
One of the most important goals of research in the field of explainable artificial intelligence is to increase the transparency of non-transparent, complex models and their predictions. In this way, it is possible to increase the degree to which people accept solutions based on artificial intelligence in practice. It can also provide humans with tools to check on artificial intelligence (e.g., if the explanation of the prediction was obviously illogical, people should take this fact into account).
When evaluating the quality of human-centered explanations, the main idea is simple – use the people to whom the explanations are addressed to evaluate their quality. We focus on two aspects – how explanations help people to perform various tasks and how people perceive explanations. In their work, Zhou et al. talk about application-grounded evaluation and human-grounded evaluation .
From the perspective of application-grounded evaluation, we are interested in how much, if at all, explanations can help users accomplish various tasks. We mostly compare different metrics of performance, efficiency or comfort of two groups of users who perform the same task. Users from Group A only get the prediction, users from group B also receive some form of explanation of the prediction.
In human-grounded evaluation, we focus on how the provided explanations were perceived by the users.
An example, quantitative and qualitative metrics
Let’s look at a simple example – a blogging platform. One of the tasks of the administrators of a blogging platform is to identify and possibly delete hateful posts that appear on the platform. If it is a blog that receives tens or hundreds of thousands of posts daily, one administrator is not able to check them all. A natural solution is to use a natural language machine-based hate speech detector that identifies potentially hateful posts. The number of posts that the administrator has to check will be reduced significantly. However, there can still be thousands of them every day. The question is, in what way can explainable artificial intelligence help to make the administrators’ work even more efficient?
In Figure 3 we can see a comparison of two situations. In the first case, the administrator can see the posts that the artificial intelligence has identified as hateful. In the second case, there are highlighted parts of the post’s text that the artificial intelligence evaluated as hateful. Since the administrator cannot simply delete a post that did not violate the blog’s rules (there’s freedom of speech), the administrator must check the posts identified by artificial intelligence before deleting them.
In the first case, the administrator has to check all the texts until he/she comes across hate speech (confirmed hate speech), or until he/she reads the entire text and decides that it is a false alarm and the post is fine.
In the second case, when there are also explanations available to the administrator, he/she can primarily focus on the highlighted parts of the texts, which can make the work significantly more efficient. Especially if it is an actual hate post and not a false alarm. In that case, it is sufficient that e.g. a racist or anti-semitic statement has occurred once, thanks to which the administrator does not have to deal with the rest of the text and can hide the post, delete it, or ask the author for a correction. This will significantly simplify and speed up the administrator’s work.
Quantitative measurement of the quality of explanations depends on the task. We will use two different methods to measure the performance of users of a system designed to detect hateful posts and users of a system to count cancer cells in histological images. In the example with the blogging platform, it would be appropriate to measure for example:
- The difference in the number of posts (efficiency) that can be reviewed by administrators who do not have an explanation and those who do.
- The difference in accuracy achieved by administrators from the first and the second group.
- Difference in ability to detect incorrect prediction of a model.
Objectivity is a great advantage of quantitative measurement. With a properly set-up experiment (balanced groups of users, equal conditions, etc.), we can say with a fairly high degree of certainty whether and how much the explanations helped the user.
In qualitative measurement, we focus more on the subjective assessment of the benefit and satisfaction of users with the explanations. Questionnaires are a common form of feedback collection. In these questionnaires, the users evaluate explanations from different points of view, e.g.:
- Usefulness – a basic view of whether users find the provided explanation useful.
- Confidence and trust – the extent to which users believe that the explanations they have received are correct. In the previous part, we said that explanations should be consistent, otherwise people will not trust them.
Despite the fact that qualitative measurement is largely subjective, it is an important part of evaluating explainability methods. It is the users (people) who are the recipients of the explanations, and therefore it is necessary to measure their subjective attitude towards them. For example, if the explanation helped the user achieve higher accuracy in completing the task, but frustrated him/her at the same time (e.g., by being too extensive), it would not be optimal.
Advantages and disadvantages of human-centered evaluation of quality of explanations
The advantage of human-centered evaluation is that we can use it to directly measure the benefit of the explanations to the users. We can get convincing evidence of how satisfied people are with explanations and how much they help them in completing various tasks.
Thanks to this, we are able to provide the user with an explanation that is “tailor-made” and that sufficiently fulfills the characteristics that a good explanation should have (you can read about them in this part of the series).
In the work published this year by Tompkins et al.  at the Workshop on Explainable Artificial Intelligence at the IJCAI conference, it became clear how important it is to validate different forms and variations of explanations available to the user. For example, at first glance it makes sense that the more extensive and complete an explanation is, the more it should help the users to complete the task. In one of the experiments with 208 participants, the authors compared the results achieved by participants who were given different numbers of so-called “counterfactual” explanations. It turned out that a less extensive explanation (1 or 2 “counterfactuals”) was objectively more beneficial for users. In addition, users themselves expressed a preference for receiving fewer explanations.
A potential disadvantage of human-centered evaluation is subjectivity and the associated sensitivity of the evaluation and its results to the selection of people involved in the evaluation. Especially if we want to involve a large number of people in the experiment, it can be difficult to find enough users that are sufficiently diverse. For example, the number of radiologists with whom we would like to test the benefit of explanations in AI-supported diagnostic software is very limited and their time is limited. Therefore, we are often dependent on evaluating with a smaller sample of end users or testing with lay users (in this case, we may not be able to perform application-grounded evaluation at all).
As we showed in the previous part of the series, a good explanation balances two components – understandability and fidelity. In human-centered evaluation, we are able to directly measure the understandability of the explanations (it is one of the questions that we can ask the user directly in the questionnaire), but we can only measure fidelity indirectly. We assume that the model is too complex for a person to fully understand its internal behavior and the way it reached the prediction. So they cannot even say whether the explanation fully corresponds to the actual behavior of the model.
Another problem is the time and resource requirements of such an evaluation. Current research in explainable artificial intelligence is constantly introducing new approaches and methods. At the same time, the explanation should be specifically tailored to the given task and model (and data). However, if we wanted to evaluate which of the many explainability methods and their various combinations would help the user most in solving a given task and which one the user would consider best, we would have to conduct a large number of human experiments for each task. This is not feasible at all.
The answer to the problem of scaling, subjectivity and the absence of a direct measurement of the fidelity of explanations is the functionality-grounded evaluation of the quality of explanations. We will talk about this family of approaches in the next part of the series.
In this part of our series on explainable artificial intelligence, we looked at two basic families of methods to evaluate the quality of explanations. We took a closer look at human-centered quality evaluation.
Using a model example, in which administrators of a blogging platform were tasked with identifying hateful posts, we showed that we measure the quality of explanations in two ways – quantitatively and qualitatively.
When evaluating the quality of human-centered explanations, we focus on two things:
- How does the provided explanation help the person in fulfilling the given task? Does it help at all?
- How does a person perceive the (subjective) explanation?
In the first case, we typically quantitatively measure whether a group of users who have an explanation do better at a task than those who do not have the explanation. Alternatively, it is possible to compare groups of users who received different explanations.
In the second case, we most often collect feedback through questionnaires, in which we find out, for example, to what extent users found the explanations useful or understandable.
In the next part of the series, we will look at the functionality-grounded evaluation of the quality of explanations, which appropriately complements the human-centered evaluation and addresses some of its disadvantages.
The PricewaterhouseCoopers Endowment Fund at the Pontis Foundation supported this project.
 ZHOU, Jianlong, et al. Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 2021, 10.5: 593.
 TOMPKINS, Rick, et al. The Effect of Diversity in Counterfactual Machine Learning Explanations. (2022). IJCAI 2022: Workshop on Explainable Artificial Intelligence (XAI). website: https://sites.google.com/view/xai2022