[预训练语言模型专题] 结合HuggingFace代码浅析Transformer

        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/><div>
<p style="text-align:center">
    <img src="https://img0.tuicool.com/fieYjqq.jpg!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>

    <span>本文为预训练语言模型专题系列第九篇</span>




            <span>快速传送门</span>
                    <strong mpa-from-tpl="t">  </strong>
        <span> </span>

1-4：[萌芽时代]、 [风起云涌] 、 [文本分类通用技巧] 、 [GPT家族] 5-8:[BERT来临]、 [浅析BERT代码] 、 [ERNIE合集] 、[ MT-DNN(KD) ]

    <span>感谢清华大学自然语言处理实验室对</span>
    <strong>预训练语言模型架构</strong>
    <span>的梳理，我们将沿此脉络前行，探索预训练语言模型的前沿技术，红框中为已介绍的文章，本期将结合HuggingFace代码浅析Transformer代码，欢迎大家留言讨论交流。</span>

<p style="text-align:center">
    <img src="https://img2.tuicool.com/BVbuEn6.png!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>

    1

        <span>
            <strong>前言</strong>
        </span>


    <span>
        <span>

前面几期一起分享了这么多基于 Transformer 的预训练语言模型，本期想和大家一起来结合代码复习一下Transformer。它是目前state-of art 语言模型中最核心的模块，替代RNN成为NLP的柱石。

    <span>
        <span>我在分享中会引用HuggingFace Transformers包中的代码，主要是BertAttention的相关代码，希望大家也能有所收获。</span>
    </span>


    2
    Attention Is All You Need（2017）

    <span>以前，处理NLP时序序列的关键模块是循环神经网络RNN（LSTM）或者卷积神经网络CNN。但是它们都有各自的问题。比如RNN无法进行并行计算，训练速度较慢，而且梯度传递有困难，容易梯度爆炸或消失。而卷积神经网络难以捕捉长距离的语义。所以，这篇文章提出了一种新的简单网络结构，称为Transformer，单纯基于attention的机制，既能并行计算提高训练速度，还能够捕捉句中的长序文本内部的联系。</span>


    <span>
        <span>直接上结构图：</span>
    </span>

<p style="text-align:center">
    <img src="https://img2.tuicool.com/zqe26ze.png!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>
<h4>Encoder-decoder</h4>

    <span>首先它是一个encoder-decoder的结构。从设计上看，左边的inputs会被encode成向量表示z，输入给右边decoder(encoder上方流出）。在解码的时候，decoder会结合z和outputs中某token之前的token来生成当前的token，是比较典型的自回归模型。我们分别说说它的encoder以及decoder。</span>

<ul>
    <li>

            Encoder：

首先在Transformer的encoder里有六层，每一层都是图中这样的两个sublayer。第一个sublayer是一个Multi-Head Attention，第二个sublayer是feed forward layer。在这两个sublayer之间都有 残差连接和层归一化 。

            Decoder:

decoder也是六层，比起encoder有两个变化，一是第一个sublayer的multi-head attention需要进行mask，因为作为一个自回归模型在decode的时候，每个词的生成时，后面的词还没有生成出来，所以attention只能看到前面的词，后面的需要被mask掉。二是中间插了一个新层来 计算encoder传来的向量与output向量的attention 。

Attention

    <span>

Transformers的精华就是Attention，接下来会结合论文和代码来介绍 attention 的基本概念和用法。

<p style="text-align:center">
    <img src="https://img0.tuicool.com/iiuEFjY.png!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>

    <span>上图左侧的叫做</span>
    <strong>Scaled Dot-Product Attention</strong>
    <span>。</span>
    <span>计算的公式可以表示为下图，Q和K两矩阵相乘后进行scale，scale的因子dk为单个头的维度，也即是上述代码中的attention_head_size，矩阵乘V得到Attention的向量表达。</span>

<p style="text-align:center">
    <img src="https://img2.tuicool.com/UvM3I3M.png!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>

    <span>
        <span>这里我们结合Transformers这个包BertSelfAttention类的代码来具体讨论</span>

。首先定义三个矩阵 query, key, value。参数量和hidden_size和 head的数量有关，关于head我们后面再提。L为文本长度。

<pre class="prettyprint"><span><span># 头的数量，以及每个头的size</span></span>

self.num_attention_heads = config.num_attention_heads self.attention_head_size = int(config.hidden_size / config.num_attention_heads) self.all_head_size = self.num_attention_heads * self.attention_head_size # 三个变换矩阵 self.query = nn.Linear(config.hidden_size, self.all_head_size) self.key = nn.Linear(config.hidden_size, self.all_head_size) self.value = nn.Linear(config.hidden_size, self.all_head_size)

这里定义的三个矩阵，实际上是要将某句文本的hidden_states变换成Q, K, V的表示，下面的代码中是具体的变换和计算。

# 将x从（batch_size, L, all_head_size)变为(batch_size, num_heads, L, attention_size)

def transpose_for_scores(self.x):

    new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)

    x = x.view(*new_x_shape)

    return x.permute(0, 2, 1, 3)

# 对hidden_states进行三种变换，形成Q, K, V

mixed_query_layer = self.query(hidden_states)

mixed_key_layer = self.key(hidden_states)

mixed_value_layer = self.value(hidden_states)




query_layer = self.transpose_for_scores(mixed_query_layer)

key_layer = self.transpose_for_scores(mixed_key_layer)

value_layer = self.transpose_for_scores(mixed_value_layer)

# Q和K相乘得到本文自注意力的评分

attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

attention_scores = attention_scores / math.sqrt(self.attention_head_size)

if attention_mask is not None:

    attention_scores = attention_scores + attention_mask




# Normalize the attention scores to probabilities.

attention_probs = nn.Softmax(dim=-1)(attention_scores)




# This is actually dropping out entire tokens to attend to, which might

# seem a bit unusual, but is taken from the original Transformer paper.

attention_probs = self.dropout(attention_probs)




# Mask heads if we want to

if head_mask is not None:

    attention_probs = attention_probs * head_mask

# softmax过的attention_prob乘V得到Attention的表示向量

context_layer = torch.matmul(attention_probs, value_layer)

结合上面代码中，我们可以观察到

hidden_states的shape是(batch_size, L, hidden_size) 。为文本产生的向量表示如Embedding。
我们通过query矩阵乘hidden_state得到mixed_query_layer (batch_size, L, all_head_size)
经过 transpose变化成为query_layer(batch_size, num_heads, L, attention_head_size) 得到上图中的 Q
同样的 key_layer（上图中K）的shape为(batch_size, num_heads, L, attention_head_size)，最后两维包括文本长度的维度进行了矩阵相乘，所以attention_scores维度变为(batch_size, num_heads, L, L)
attention_score 经过softmax ，长度方向归一化了，乘于value变为context_layer(batch_size, num_heads, L, attention_head_size)

    <span>

所以Q(query) ，K(key)，V(value) 都是由原来文本向量乘于对应矩阵得到。这三种变换的矩阵参数都是我们需要通过训练学习。Q矩阵乘K以后，得到(batch_size, num_heads, L, L)的矩阵，我理解为文本的每个位置和其他位置都相乘得出一个数值，这个数值我们可以看作文本的每个token和其他token的相关度，即为self attention的 score ，越大一般这两个token关系就越密切，softmax以后变成一个0-1的数，这时候score再矩阵乘value我们就可以得到一个上下文相关向量的attention表示了。

    <span>
        <span>前面还提到了</span>
        <strong>Multi-Head Attention</strong>
        <span>，多头注意力。相比于进行一次attention function，进行h次效果会更好。也就是说，我们会初始化h个不同的query, key, value矩阵，每个大小为（hidden_state, attention_head_size)， 在上面代码中，我们实际上初始了一个（hidden_state, attention_head_size * num_heads) 大小的矩阵，与num_heads(h)个单头的attention矩阵是一致的，矩阵中其实参数是单头的num_heads份。然后每份参数去进行上面的attention运算，最后把多份的attention 拼接起来成为了最终的multi-head attention。</span>
    </span>

<h4>Positional Encoding</h4>

    <span>
        <span>

Attention机制让Transformer 得以能够建立长距离的语义关联，但是我们可以注意到，在encoder和decoder中，我们用的都是fully connected layer，所以每个位置的token都是独立的，你会失去语序的信息。所以有必要告诉网络，token之间的相对或绝对信息，所以文章引入了 Positional Encoding 。

<p style="text-align:center">
    <img src="https://img0.tuicool.com/fmy2uqR.png!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>

    <span>pos是位置，i是Embedding的某个维度, 2i指的偶数维度的Embedding, 2i+1指的是奇数维度的Embedding。i 是从 0到 dmodel / 2 ，所以对不同层的Embedding，sin和cos函数的波长是从 2pi 到 20000 pi。对于固定的i,  pos变化就会引起Embedding以正弦或余弦变化。这样encode了以后，模型就能去对相对的位置进行建模。</span>

<h4>Visualizations</h4>

    <span>

Attention的另外一个好处是， 可视化和可解释性 加强了。我们从下图可以看到masking这个词，它主要和周围的词产生比较强的联系，尤其是和more difficult关系较大，这个是比较合理。我们还可以看到的是不同颜色代表不同的是不同的头，不同头的结果其实差别挺大的，这也是为什么多头注意力能带来收益的原因。

<p style="text-align:center">
    <img src="https://img0.tuicool.com/jM3QV3M.png!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>
从下面这张图，我们可以看到attention基本可以捕捉句子的结构。  
<p style="text-align:center">
    <img src="https://img0.tuicool.com/Ire2aav.png!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>

    <span>最后，作者比较了不同任务上Transformer的效果，我就贴出在机器翻译上的结果，显示Transformer确实在效果和效率上都有所提升。</span>

<p style="text-align:center">
    <img src="https://img2.tuicool.com/FBjQFfY.png!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>

    <span>
        <strong mpa-from-tpl="t" mpa-is-content="t">未完待续</strong>
    </span>


    <span md-inline="plain">本期的论文就给大家分享到这里，感谢大家的阅读和支持，下期我们会给大家带来其他预训练语言模型的介绍，敬请大家期待！</span>


    <span>
        <strong>推荐阅读</strong>
    </span>
    <strong/>


    <span>
        <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412170&amp;idx=1&amp;sn=67ca8bd3235f647bb4f613a5501015f1&amp;chksm=becd92d089ba1bc63198c3c531dec05dd21a1e71a2333a4f938fe91566c9085173832255946e&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">AINLP年度阅读收藏清单</a>
    </span>


    <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412822&amp;idx=1&amp;sn=615917ca3062c6afb0ec69557a6886da&amp;chksm=becd914c89ba185a20e359b1114eccc9e0c6c84a6b7ec596e92fdd532910c5b1fe34c47000fb&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">百度PaddleHub&amp;nbsp;NLP模型全面升级，推理性能提升50%以上</a>


    <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412817&amp;idx=1&amp;sn=fdff8ce7a673bb11139c2957383abf48&amp;chksm=becd914b89ba185dc05669440e15422b299972a5a8c5da41fcc274f865e3ba076610bdf0a527&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">斯坦福大学NLP组Python深度学习自然语言处理工具Stanza试用</a>


    <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412733&amp;idx=1&amp;sn=2c36b607e1f2d87d6819cdd9028e4da5&amp;chksm=becd90e789ba19f127e9cb58f8518847aa758dcc802f028c2a8d451dbdc92774c373e9fd24f1&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">数学之美中盛赞的 Michael Collins 教授，他的NLP课程要不要收藏？</a>


    <span>
        <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412689&amp;idx=2&amp;sn=f7cc853175a2a31a12feb169be0e61ff&amp;chksm=becd90cb89ba19dd629b0e03c94a54c8902b9d9f1ba595a3d0bfd75a69ed2a975fe2d1251b86&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">自动作诗机&amp;藏头诗生成器：五言、七言、绝句、律诗全了</a>
    </span>


    <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412641&amp;idx=2&amp;sn=1c5eb86ab9be36905ff68c70b8bb89ec&amp;chksm=becd903b89ba192d987307313d75b086231e074bd4a94fb3a01e67af62c0270f0125fd5996bd&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">From Word Embeddings To Document Distances 阅读笔记</a>


    <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412629&amp;idx=1&amp;sn=d5e182941286af6adb745d8393f35151&amp;chksm=becd900f89ba19199ac6c4fb31a2717d05363ebdbf5371f5dd5ec03d6af1e4ddd28c1dc1ad35&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">模型压缩实践系列之——bert-of-theseus，一个非常亲民的bert压缩方法</a>


    <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412635&amp;idx=1&amp;sn=6712b9c4bb5f0e7678f5475fb2d7defb&amp;chksm=becd900189ba1917f87709d59275d8871fc01c102b6c9c3f6ff999ab7aa891b30384e5d13a02&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">这门斯坦福大学自然语言处理经典入门课，我放到B站了</a>


    <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412195&amp;idx=2&amp;sn=5363c291223c99f537fc3ea8a951829f&amp;chksm=becd92f989ba1beff71076a5d98ecebc950026186d88d73a010b306961da357054a6db1019ee&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">
        <span>可解释性论文阅读笔记1-Tree Regularization</span>
    </a>


    <a target="_blank" href="http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&amp;mid=2650412214&amp;idx=2&amp;sn=4bcf51e1e267290cae8e29c53a102851&amp;chksm=becd92ec89ba1bfad686b2381a47f7da7332d61d22c11779dcb5b59d926412098e4e424af17a&amp;scene=21#wechat_redirect" tab="innerlink" hasload="1" rel="nofollow,noindex">征稿启示 | 稿费+GPU算力+星球嘉宾一个都不少</a>

<h4>关于AINLP</h4>
AINLP 是一个有趣有AI的自然语言处理社区，专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享，主题包括文本摘要、智能问答、聊天机器人、机器翻译、自动生成、知识图谱、预训练模型、推荐系统、计算广告、招聘信息、求职经验分享等，欢迎关注！加技术交流群请添加AINLPer(id：ainlper)，备注工作/研究方向+加群目的。
<p style="text-align:center">
    <img src="https://img1.tuicool.com/qIR3Abr.jpg!web" class="alignCenter" referrerpolicy="no-referrer"/>
</p>