[Paper Summary] Attention Is All You Need


model architecture

encoder-decoder paring in brief

transformer architecture in more details

  • in each encoder and decoder layer, the self-attention enables for each position to attend to all positions in principle
  • it implements the seq-to-seq transduction masking all positions rightward of the current step in the decoder layers
  • the topmost encoder output acts as the keys & values entering into each decoder layer

scaled dot-product and multi-head attention

scaled dot-product attention

  • the commonly used attention method since NMT by jointly learning to align and translate
  • it deals with the inner-product magnitude scale problem of large-sized vectors by introducing a scaling factor(1/sqrt(dk))
  • computationally cheaper than additive attention with appropriate application of a scaling factor as practical heuristic

multi-head attention

  • it provides with multiple attention results using different parallel ‘scaled dot-product attention’ heads
  • and it has a compute load almost the same as that of a single head attention with the full dimensionality
  • a brilliant choice for both general performance regularity and computational efficiency

position-wise feed-forward networks

  • each layer in the encoder/decoder has a feed forward network with parameters shared across positions but unique per layer

embeddings and softmax

  • the same matrix is shared between the input/output embedding layers and the pre-softmax linear transformation
  • and multiply those embedding layer weights by sqrt(dmodel)

positional encoding

  • sinusoidal functions are used to inject the sequence order information to help the model learn to attend by relative positions
  • any position with some fixed offset k, PEpos+k, can be represented as a linear function of PEpos

why self-attention

  • complexity per layer
    • self-attention layer has less computational complexity than recurrent layer if n is less than d which is most often the case
    • self-attention layer has much less complexity than conv layer (separable convolutions have complexity between self-attention only and self-attention + FFN)
    • when n is very large, restricted self-attention could be an alternative but it increases the max path length to O(n/r)
  • sequential operations
    • self-attention layer has a constant number of sequential operations as conv layer and restricted self-attention layer do while recurrent layer does the worst
  • maximum path length
    • a single self-attention layer simply connects all positions to each other which no other layer type can do


  • training data and batching
    • WMT 2014 English-German
      • 4.5 million sentence pairs
      • shared source/target vocab of about 37000 tokens with BPE
    • WMT 2014 English-French
      • 36 million sentence pairs
      • shared source/target vocab of about 32000 tokens with word-piece
    • sentence-pair batching of approximate sequence length
      • a set of sentence pairs with approximately 25000 source/target tokens respectively
  • hardware and schedule
    • one machine with 8 P100 GPUs
    • base model
      • training step time: 0.4 seconds
      • training for 100000 steps (12 hours)
    • big model
      • training step time: 1.0 seconds
      • training for 300000 steps (3.5 days)
  • optimizer
    • Adam optimizer with beta1 = 0.9, beta2 = 0.98, epsilon = 10-9
    • learning rate varying over the course of training with warmup_steps = 4000:
  • regularization
    • residual dropout
      • dropout applied in the encoder/decoder stacks to:
        • multi-head attention output before ‘Add & Norm’
        • FFN output before ‘Add & Norm’
        • input/output embeddings + positional encodings
      • dropout probability for the base model: 0.1
    • label smoothing
      • epsilon = 0.1
      • performance: better for accuracy and BLUE while worse for perplexity



[Book] Probabilistic Machine Learning

A brand new ML book series by Kevin Murphy renewing his previous ML textbook(Machine Learning: A Probabilistic Perspective).
The first of two is available online as a pdf and the second is to come in 2022.
Two books to rule them all. 🙂

* book website
* Probabilistic Machine Learning: An Introduction
* Probabilistic Machine Learning: Advanced Topics
* repo for python 3 code for “Probabilistic Machine Learning”

Statement of accomplishment for Coursera Machine Learning class

Finally acquired the statement of accomplishment for Machine Learning on Coursera by Andrew Ng. Actually it took me more than 2 years to make it since the first opening of the class in the year of 2012. I don’t wanna make any excuse for such procrastination but, as others would do, I used to have some difficulty sparing the time for the class solely. Yeah, shame on me but I finished the course with a record of 100%. 🙂

Anyway, I’m personally proud of what I’ve done for the last couple months. And deep appreciation to Andrew Ng for this encouraging class.

Statement of accomplishment: Coursera ml 2014

the one posting on the curse of dimensionality

i just found a blog on computer vision and machine learning, and got a chance to have a fundamental and clear understanding of the curse of dimensionality.

the posting above provides a very thorough and fundamental explanation about the essence of the curse of dimensionality. the other postings on that blog also seem to be definitely must-read articles for ML newbies, just like me. 🙂

it would be really worth reading the whole materials on that blog, and that’s what i’m gonna do for the next week.

plotly – to graph and share your data

Analyze and visualize data, together.

One sorta fresh & brand-new social network community, or something alike, service for data scientists or people with interest in data science: plotly. Just got signed up out of pure curiosity and now I think it’s definitely worth a go for data geeks. Planning to make my own life-logging record with a few gadgets like arduino, specifically, it’s really an appealing feature to provide users with the arduino API to stream data from hardware devices, as written in the front page of their site. Better pay visits more often.

Undergraduate ML course at UBC 2012

I happened to find these undergraduate ML course video clips opened on youtube by UBC. I think I really have to thank to UBC & the professor Nando de Freitas for this valuable sharing. Anyone with strong interest or something in ML should definitely be better to take a look through the whole sessions. That’s what I’m gonna do for months now.

New Wolfram Programming Language

Simply way beyond the realistic boundary all other existing programming languages couldn’t have escaped by all means. I really wanna give it a try and find out what’d be possible with this level-shifting invention by 30-years’ endeavor of Stephen Wolfram.

Ryan Swanstrom

Stephen Wolfram

, founder of Wolfram Research and creator of Mathematica, just announced the new Wolfram Programming Language. This is really exciting and cool, so please take some time to watch the video. I think this might be a

game changer

in data science.

View original post