【3】使用Captum库解释BERT模型
-
Captum是PyTorch的模型可解释性和理解库。Captum在拉丁语中表示理解,包含PyTorch模型的集成梯度、显著图、平滑、vargrad等通用实现。它可以快速集成使用特定于领域的库(如torchvision、torchtext等)构建的模型。
https://github.com/pytorch/captum/blob/master/tutorials/Bert_SQUAD_Interpret2.ipynbInterpreting BertLayer Outputs and Self-Attention Matrices in each Layer
Now let’s look into the layers of our network. More specifically we would like to look into the distribution of attribution scores for each token across all layers and attribution matrices for each head in all layers in Bert model.
We do that using one of the layer attribution algorithms, namely, layer conductance. However, we encourage you to try out and compare the results with other algorithms as well.Let’s configure
InterpretableEmbeddingsBase
again, in this case in order to interpret the layers of our model.现在让我们看看我们的网络层。 更具体地说,我们想研究 Bert 模型中所有层中每个令牌的归因分数分布以及所有层中每个头部的归因矩阵。
我们使用其中一种层归因算法来做到这一点,即layer conductance。 但是,我们鼓励您尝试并将结果与其他算法进行比较。让我们再次配置
InterpretableEmbeddingsBase
,在这种情况下是为了解释我们模型的层。interpretable_embedding = configure_interpretable_embedding_layer(model, 'bert.embeddings.word_embeddings')
c:\users\yujun\appdata\local\programs\python\python37\lib\site-packages\captum\attr\_models\base.py:189: UserWarning: In order to make embedding layers more interpretable they will be replaced with an interpretable embedding layer which wraps the original embedding layer and takes word embedding vectors as inputs of the forward function. This allows us to generate baselines for word embeddings and compute attributions for each embedding dimension. The original embedding layer must be set back by calling `remove_interpretable_embedding_layer` function after model interpretation is finished. "In order to make embedding layers more interpretable they will "
Let’s iterate over all layers and compute the attributions w.r.t. all tokens in the input and attention matrices.
Note: Since below code is iterating over all layers it can take over 5 seconds. Please be patient!
让我们迭代所有层并计算属性 w.r.t. 输入和注意力矩阵中的所有标记。
注意:由于下面的代码迭代所有层,因此可能需要 5 秒以上。 请耐心等待!
layer_attrs_start = [] layer_attrs_end = [] layer_attn_mat_start = [] layer_attn_mat_end = [] input_embeddings, ref_input_embeddings = construct_whole_bert_embeddings(input_ids, ref_input_ids, \ token_type_ids=token_type_ids, ref_token_type_ids=ref_token_type_ids, \ position_ids=position_ids, ref_position_ids=ref_position_ids) for i in range(model.config.num_hidden_layers): lc = LayerConductance(squad_pos_forward_func, model.bert.encoder.layer[i]) layer_attributions_start = lc.attribute(inputs=input_embeddings, baselines=ref_input_embeddings, additional_forward_args=(token_type_ids, position_ids,attention_mask, 0)) layer_attributions_end = lc.attribute(inputs=input_embeddings, baselines=ref_input_embeddings, additional_forward_args=(token_type_ids, position_ids,attention_mask, 1)) layer_attrs_start.append(summarize_attributions(layer_attributions_start[0])) layer_attrs_end.append(summarize_attributions(layer_attributions_end[0])) layer_attn_mat_start.append(layer_attributions_start[1]) layer_attn_mat_end.append(layer_attributions_end[1])
# layer x seq_len layer_attrs_start = torch.stack(layer_attrs_start) # layer x seq_len layer_attrs_end = torch.stack(layer_attrs_end) # layer x batch x head x seq_len x seq_len layer_attn_mat_start = torch.stack(layer_attn_mat_start) # layer x batch x head x seq_len x seq_len layer_attn_mat_end = torch.stack(layer_attn_mat_end)
As a reminder of Part 1 we visualize the heatmaps of the attributions for the outputs of all 12 layers in the plots below. The outputs of 12 layers are also known as context layer which represents the dot product between the attribution matrices and value vector.
The plot below represents a heatmap of attributions across all layers and tokens for the start position prediction.
Note that here we do not have information about different heads. Heads related information will be examined separately when we visualize the attribution scores of the attention matrices w.r.t. the start or end position predictions.
It is interesting to observe that the question word
what
gains increasingly high attribution from layer one to ten. In the last two layers that importance is slowly diminishing.
In contrary towhat
token, many other tokens have negative or close to zero attribution in the first 6 layers.We start seeing slightly higher attribution in tokens
important
,us
andto
. Interestingly tokenimportant
is also assigned high attribution score which is remarkably high in the fifth and sixth layers.Lastly, our correctly predicted token
to
gains increasingly high positive attribution especially in the last two layers.作为第 1 部分的提醒,我们将下图中所有 12 层输出的属性热图可视化。 12 层的输出也称为上下文层,它表示属性矩阵和值向量之间的点积。
下图表示了开始位置预测的所有层和标记的属性热图。
请注意,这里我们没有关于不同头的信息。当我们可视化注意力矩阵 w.r.t. 的归因分数时,将单独检查与头部相关的信息。开始或结束位置预测。
有趣的是,问题词
what
从第一层到第十层获得越来越高的归因。在最后两层,重要性正在慢慢降低。
与what
令牌相反,许多其他令牌在前 6 层中具有负面或接近于零的属性。我们开始看到标记
important
、us
和to
的归因略高。有趣的是,标记重要
也被分配了高归因分数,这在第五层和第六层非常高。最后,我们正确预测的令牌
to
获得了越来越高的正面归因,尤其是在最后两层。fig, ax = plt.subplots(figsize=(15,5)) xticklabels=all_tokens yticklabels=list(range(1,13)) ax = sns.heatmap(layer_attrs_start.cpu().detach().numpy(), xticklabels=xticklabels, yticklabels=yticklabels, linewidth=0.2) plt.xlabel('Tokens') plt.ylabel('Layers') plt.show()
Now let’s examine the heat map of the attributions for the end position prediction. In the case of end position prediction we again observe high attribution scores for the token
what
in the last 11 layers.
Correctly predicted end tokenkinds
has positive attribution across all layers and it is especially prominent in the last two layers. It’s also interesting to observe thathumans
token also has relatively high attribution score in the last two layers.现在让我们检查结束位置预测的属性热图。 在结束位置预测的情况下,我们再次观察到标记
what
在最后 11 层中的高归因分数。
正确预测的结束标记种类
在所有层中都有积极的属性,并且在最后两层中尤为突出。 观察到humans
token在最后两层也具有相对较高的归因分数也很有趣。fig, ax = plt.subplots(figsize=(15,5)) xticklabels=all_tokens yticklabels=list(range(1,13)) ax = sns.heatmap(layer_attrs_end.cpu().detach().numpy(), xticklabels=xticklabels, yticklabels=yticklabels, linewidth=0.2) #, annot=True plt.xlabel('Tokens') plt.ylabel('Layers') plt.show()
It is interesting to note that when we compare the heat maps of start and end position, overall the colors for start position prediction on the map have darker intensities. This implies that there are less tokens that attribute positively to the start position prediction and there are more tokens which are negative indicators or signals of start position prediction.
有趣的是,当我们比较开始和结束位置的热图时,总体上地图上开始位置预测的颜色具有更暗的强度。 这意味着正面归因于开始位置预测的token较少,而作为开始位置预测的负面指标或信号的token较多。
Interpreting Attribution Scores for Attention Matrices
In this section we visualize the attribution scores of start and end position predictions w.r.t. attention matrices.
Note that each layer has 12 heads, hence attention matrices. We will first visualize for a specific layer and head, later we will summarize across all heads in order to gain a bigger picture.在本节中,我们将开始和结束位置预测 w.r.t. 的归因分数可视化。 注意矩阵。
请注意,每层有 12 个头,因此是注意力矩阵。 我们将首先对特定层和头部进行可视化,然后我们将总结所有头部以获得更大的图景。Below we visualize the attribution scores of 12 heads for selected layer
layer
for start position prediction.下面我们将所选层
layer
的 12 个头的归因分数可视化以进行起始位置预测。visualize_token2token_scores(layer_attn_mat_start[layer].squeeze().cpu().detach().numpy())
As we can see from the visualizations above, in contrary to attention scores the attributions of specific target w.r.t. to those scores are more meaningful and most importantly, they do not attend to
[SEP]
token or show diagonal patterns. We observe that heads 4, 9, 12 and 2 show strong relationship betweenwhat
andit
tokens when predicting start position, head 10 and 11 betweenit
andit
, heads 8 betweenimportant
andto
and head 1 betweento
andwhat
. Note thatto
token is the start position of the answer token. It is also important to mention that these observations are for a selectedlayer
. We can change the index of selectedlayer
and examine interesting relationships in other layers.正如我们从上面的可视化中看到的,与注意力得分相反的是特定目标 w.r.t. 的属性。 这些分数更有意义,最重要的是,它们不会关注
[SEP]
标记或显示对角线模式。 我们观察到,在预测开始位置时,第 4、9、12 和 2 个指示符之间显示出what
和it
标记之间的强相关性,it
和it
之间的第 10 和 11 个指示符,important
和to
之间的第 8 个指示符 和在to
和what
之间的 head 1。 请注意,to
标记是答案标记的开始位置。 同样重要的是要提到这些观察是针对选定的layer
。 我们可以更改所选layer
的索引并检查其他层中的有趣关系。In the cell below we visualize the attention attribution scores normalized across the head axis.
在下面的单元格中,我们可视化了跨头轴标准化的注意力归因分数。
visualize_token2token_scores(norm_fn(layer_attn_mat_start, dim=2).squeeze().detach().cpu().numpy(), x_label_name='Layer')
By looking at the visualizations above we can see that the model pays attention to very specific handpicked relationships when making a sprediction for start position. Most notably in the layers 10, 7, 11 and 4 it focuses more on the relationships between
it
andis
,important
andto
.通过查看上面的可视化,我们可以看到模型在预测开始位置时会注意非常具体的精心挑选的关系。 最值得注意的是,在第 10、7、11 和 4 层,它更多地关注
it
和is
、important
和to
之间的关系。Now let’s run the same experiments for the end position prediction. Below we visualize the attribution scorese of attention matrices for the end position prediction for the selected
layer
.现在让我们为结束位置预测运行相同的实验。 下面我们将注意力矩阵的归因分数可视化为所选
layer
的结束位置预测。visualize_token2token_scores(layer_attn_mat_end[layer].squeeze().cpu().detach().numpy())
As we can see from the visualizations above that for the end position prediction we have stronger attention towards the end of the answer token
kinds
. Here we can see stronger connection betweenhumans
andkinds
in the 11th head,it
andem
,power
,and
in the 5th, 6th and 8th heads. The connections betweenit
andwhat
are also strong in first couple and 10th heads.从上面的可视化中可以看出,对于结束位置预测,我们对答案token
kinds
的末尾有更强的关注。 在这里,我们可以在第 11 个头中看到humans
和kinds
之间更强的联系,it
和em
,power
,and
在第 5、6 和 8 个头中。it
和what
之间的联系在第一对和第 10 对头中也很强。Similar to start position let’s visualize the norm across all heads for each layer.
与起始位置类似,让我们可视化每层所有头部的范数。
visualize_token2token_scores(norm_fn(layer_attn_mat_end, dim=2).squeeze().detach().cpu().numpy(), x_label_name='Layer')
As we can see from the visualizations above for the end position prediction there is a relation learnt between
[SEP]
and.
in first and second layers. Also we observe thatit
token is strongly related towhat
,important
andto
.正如我们从上面结束位置预测的可视化中看到的,在第一层和第二层中,在‘[SEP]’和‘.’之间学习到了一种关系。 我们还观察到
it
标记与what
、important
和to
密切相关。Computing and Visualizing Vector Norms
In this section of the tutorial we will compute Vector norms for activation layers such as ||f(x)||, ||α * f(x)|| and ||Σαf(x)|| as also described in the: https://arxiv.org/pdf/2004.10102.pdf
As also shown in the paper mentioned above, normalized activations are better indicators of importance scores than the attention scores however they aren’t as indicative as the attribution scores. This is because normalized activations ||f(x)|| and ||α * f(x)|| aren’t attributed to a specific output prediction. From our results we can also see that according to those normalized scores
[SEP]
tokens are insignificant.在本教程的这一部分中,我们将计算激活层的向量范数,例如 ||f(x)||、||α * f(x)|| 和 ||Σαf(x)|| 还描述在:https://arxiv.org/pdf/2004.10102.pdf
正如上面提到的论文中所示,归一化激活是比注意力分数更好的重要性分数指标,但它们不像归因分数那样具有指示性。 这是因为标准化激活 ||f(x)|| 和 ||α * f(x)|| 不归因于特定的输出预测。 从我们的结果中我们还可以看到,根据那些归一化的分数,
[SEP]
标记是微不足道的。Below we define / extract all parameters that we need to computation vector norms.
下面我们定义/提取计算向量范数所需的所有参数。
output_attentions_all_shape = output_attentions_all.shape batch = output_attentions_all_shape[1] num_heads = output_attentions_all_shape[2] head_size = 64 all_head_size = 768
In order to compute above mentioned norms we need to get access to dense layer’s weights and value vector of the self attention layer.
为了计算上述范数,我们需要访问密集层的权重和自我注意层的值向量。
Getting Access to Value Activations
Let’s define the list of all layers for which we would like to access Value Activations.
让我们定义我们想要访问 Value Activations 的所有层的列表。
layers = [model.bert.encoder.layer[layer].attention.self.value for layer in range(len(model.bert.encoder.layer))]
We use
Captum
’s LayerActivation algorithm to access the outputs of alllayers
.我们使用
Captum
的 LayerActivation 算法来访问所有层
的输出。la = LayerActivation(squad_pos_forward_func, layers) value_layer_acts = la.attribute(input_embeddings, additional_forward_args=(token_type_ids, position_ids, attention_mask)) # shape -> layer x batch x seq_len x all_head_size value_layer_acts = torch.stack(value_layer_acts)
In the cell below we perform several transformations with the value layer activations and bring it to the shape so that we can compute different norms. The transformations are done the same way as it is described in the original paper and corresponding github implementation.
在下面的单元格中,我们对值层激活进行了几次转换,并将其变为形状,以便我们可以计算不同的范数。 转换的完成方式与原始论文和相应的 github 实现中描述的方式相同。
new_x_shape = value_layer_acts.size()[:-1] + (num_heads, head_size) value_layer_acts = value_layer_acts.view(*new_x_shape) # layer x batch x neum_heads x 1 x head_size value_layer_acts = value_layer_acts.permute(0, 1, 3, 2, 4) value_layer_acts = value_layer_acts.permute(0, 1, 3, 2, 4).contiguous() value_layer_acts_shape = value_layer_acts.size() # layer x batch x seq_length x num_heads x 1 x head_size value_layer_acts = value_layer_acts.view(value_layer_acts_shape[:-1] + (1, value_layer_acts_shape[-1],)) print('value_layer_acts: ', value_layer_acts.shape)
value_layer_acts: torch.Size([12, 1, 26, 12, 1, 64])
Getting Access to Dense Features
Now let’s transform dense features so that we can use them to compute ||f(x)|| and ||α * f(x)||.
现在让我们转换密集特征,以便我们可以使用它们来计算 ||f(x)|| 和 ||α * f(x)||。
dense_acts = torch.stack([dlayer.attention.output.dense.weight for dlayer in model.bert.encoder.layer]) dense_acts = dense_acts.view(len(layers), all_head_size, num_heads, head_size) # layer x num_heads x head_size x all_head_size dense_acts = dense_acts.permute(0, 2, 3, 1).contiguous()
Computing f(x) score by multiplying the value vector with the weights of the dense vector for all layers.
通过将值向量与所有层的密集向量的权重相乘来计算 f(x) 分数。
# layers, batch, seq_length, num_heads, 1, all_head_size f_x = torch.stack([value_layer_acts_i.matmul(dense_acts_i) for value_layer_acts_i, dense_acts_i in zip(value_layer_acts, dense_acts)]) f_x.shape
torch.Size([12, 1, 26, 12, 1, 768])
# layer x batch x seq_length x num_heads x 1 x all_head_size) f_x_shape = f_x.size() f_x = f_x.view(f_x_shape[:-2] + (f_x_shape[-1],)) f_x = f_x.permute(0, 1, 3, 2, 4).contiguous() #(layers x batch, num_heads, seq_length, all_head_size) f_x_shape = f_x.size() #(layers x batch, num_heads, seq_length) f_x_norm = norm_fn(f_x, dim=-1)
Now let’s visualize ||f(x)|| scores for all layers and examine the distribution of those scores.
现在让我们形象化 ||f(x)|| 所有层的分数并检查这些分数的分布。
visualize_token2head_scores(f_x_norm.squeeze().detach().cpu().numpy())
When we examine ||f(x)|| scores for all layers we can easily see that the
[SEP]
token receives the lowest score across all layers. This is one of the conclusions that the original paper came to. In terms of other tokens we can see that the heads in different layers focus on different parts of the input sentence.当我们检查 ||f(x)|| 所有层的分数我们可以很容易地看到,
[SEP]
令牌在所有层中获得的分数最低。 这是原论文得出的结论之一。 在其他 token 方面,我们可以看到不同层的 head 集中在输入句子的不同部分。Now let’s compute ||α * f_x||. This computation is performed using the original paper’s technique with the help of einsum operator.
现在让我们计算 ||α * f_x||。 该计算是在 einsum 运算符的帮助下使用原始论文的技术执行的。
# layer x batch x num_heads x seq_length x seq_length x all_head_size alpha_f_x = torch.einsum('lbhks,lbhsd->lbhksd', output_attentions_all, f_x) # layer x batch x num_heads x seq_length x seq_length alpha_f_x_norm = norm_fn(alpha_f_x, dim=-1)
Let’s now visualize ||α * f_x|| scores for the layer with index
layer
.现在让我们可视化 ||α * f_x|| 索引为“layer”的图层的分数。
visualize_token2token_scores(alpha_f_x_norm[layer].squeeze().detach().cpu().numpy())
As we can see from the visualizations above there is no strong attention to
[SEP]
or[CLS]
tokens. Some of the heads show diagonal patterns and some of them show strong attention between specific pairs of tokens.正如我们从上面的可视化中看到的,没有对
[SEP]
或[CLS]
标记的强烈关注。 一些头部显示出对角线图案,其中一些在特定的token对之间表现出强烈的注意力。Now let’s compute the summed norm across
num_heads
axis ||Σαf(x)|| and visualize normalized scores for each layer.现在让我们计算跨
num_heads
轴的总范数 ||Σαf(x)|| 并可视化每一层的归一化分数。summed_alpha_f_x = alpha_f_x.sum(dim=2) # layers x batch x seq_length x seq_length summed_alpha_f_x_norm = norm_fn(summed_alpha_f_x, dim=-1)
visualize_token2token_scores(summed_alpha_f_x_norm.squeeze().cpu().detach().numpy(), x_label_name='Layer')
Above visualizations also confirm that the attention scores aren’t concentrated on the tokens such as
[CLS]
,[SEP]
and.
however we see stronger signals along the diagonals and some patches of stronger signals between certain parts of the text including some tokens in the question part that are relevant in the answer piece.上面的可视化还证实了注意力分数并不集中在诸如
[CLS]
、[SEP]
和.
之类的标记上,但是我们看到沿对角线的信号更强,并且在某些部分之间看到了一些更强的信号块。 问题部分中包含一些与答案相关的标记的文本。It is important to mention that all experiments were performed for one input sample, namely,
sentence
. In the papers we often see aggregation of the results across multiple samples. For further analysis and more convincing propositions we recommend to conduct the experiments across multiple input samples. In addition to that it would be also interesting to look into the correlation of heads in layer and across different layers.值得一提的是,所有实验都是针对一个输入样本进行的,即
句子
。 在论文中,我们经常看到跨多个样本的结果聚合。 为了进一步分析和更有说服力的命题,我们建议在多个输入样本中进行实验。 除此之外,研究层中和不同层之间头部的相关性也很有趣。