Category Archives: Intelligence (AI)

[Paper Summary] Attention Is All You Need


model architecture

encoder-decoder paring in brief

transformer architecture in more details

  • in each encoder and decoder layer, the self-attention enables for each position to attend to all positions in principle
  • it implements the seq-to-seq transduction masking all positions rightward of the current step in the decoder layers
  • the topmost encoder output acts as the keys & values entering into each decoder layer

scaled dot-product and multi-head attention

scaled dot-product attention

  • the commonly used attention method since NMT by jointly learning to align and translate
  • it deals with the inner-product magnitude scale problem of large-sized vectors by introducing a scaling factor(1/sqrt(dk))
  • computationally cheaper than additive attention with appropriate application of a scaling factor as practical heuristic

multi-head attention

  • it provides with multiple attention results using different parallel ‘scaled dot-product attention’ heads
  • and it has a compute load almost the same as that of a single head attention with the full dimensionality
  • a brilliant choice for both general performance regularity and computational efficiency

position-wise feed-forward networks

  • each layer in the encoder/decoder has a feed forward network with parameters shared across positions but unique per layer

embeddings and softmax

  • the same matrix is shared between the input/output embedding layers and the pre-softmax linear transformation
  • and multiply those embedding layer weights by sqrt(dmodel)

positional encoding

  • sinusoidal functions are used to inject the sequence order information to help the model learn to attend by relative positions
  • any position with some fixed offset k, PEpos+k, can be represented as a linear function of PEpos

why self-attention

  • complexity per layer
    • self-attention layer has less computational complexity than recurrent layer if n is less than d which is most often the case
    • self-attention layer has much less complexity than conv layer (separable convolutions have complexity between self-attention only and self-attention + FFN)
    • when n is very large, restricted self-attention could be an alternative but it increases the max path length to O(n/r)
  • sequential operations
    • self-attention layer has a constant number of sequential operations as conv layer and restricted self-attention layer do while recurrent layer does the worst
  • maximum path length
    • a single self-attention layer simply connects all positions to each other which no other layer type can do


  • training data and batching
    • WMT 2014 English-German
      • 4.5 million sentence pairs
      • shared source/target vocab of about 37000 tokens with BPE
    • WMT 2014 English-French
      • 36 million sentence pairs
      • shared source/target vocab of about 32000 tokens with word-piece
    • sentence-pair batching of approximate sequence length
      • a set of sentence pairs with approximately 25000 source/target tokens respectively
  • hardware and schedule
    • one machine with 8 P100 GPUs
    • base model
      • training step time: 0.4 seconds
      • training for 100000 steps (12 hours)
    • big model
      • training step time: 1.0 seconds
      • training for 300000 steps (3.5 days)
  • optimizer
    • Adam optimizer with beta1 = 0.9, beta2 = 0.98, epsilon = 10-9
    • learning rate varying over the course of training with warmup_steps = 4000:
  • regularization
    • residual dropout
      • dropout applied in the encoder/decoder stacks to:
        • multi-head attention output before ‘Add & Norm’
        • FFN output before ‘Add & Norm’
        • input/output embeddings + positional encodings
      • dropout probability for the base model: 0.1
    • label smoothing
      • epsilon = 0.1
      • performance: better for accuracy and BLUE while worse for perplexity



New Wolfram Programming Language

Simply way beyond the realistic boundary all other existing programming languages couldn’t have escaped by all means. I really wanna give it a try and find out what’d be possible with this level-shifting invention by 30-years’ endeavor of Stephen Wolfram.

Ryan Swanstrom

Stephen Wolfram

, founder of Wolfram Research and creator of Mathematica, just announced the new Wolfram Programming Language. This is really exciting and cool, so please take some time to watch the video. I think this might be a

game changer

in data science.

View original post